0:21

One of the interesting things about this RNA

Â is that it forms what we call secondary structure.

Â So if we think about RNA it has a sequence of letters over some alphabet.

Â And one of the interesting things it does inside the cell is that it

Â forms a secondary structure where it moves onto itself or it falls onto itself.

Â And one of the interesting questions to ask, or

Â that biologists are very interested in,

Â is trying to predict if someone gives me the sequence of the RNA, to try tp predict

Â how that sequence is going to fold into itself and form that secondary structure.

Â So this is a very interesting computational problem in biology.

Â That is, that is,

Â is of interest to the computational community because we can come up with

Â algorithms for predicting the structure from primary sequence or from a sequence.

Â So now we are going to go through this algorithmic thinking exercise to get

Â from this biological problem all the way to our,

Â to our computationally cleanly defined problem, and then.

Â Think about an algorithm to solve it for the biologist.

Â So the first question when, when the biologist comes to us and

Â says I want you to predict the constructor for me.

Â The first question is okay, what is this problem?

Â What is the input?

Â What is the output?

Â He says or she says that the, the input, what I have, the data I have.

Â Are RNA sequences, okay?

Â What is an RNA sequence?

Â 1:58

This is an RNA sequence, this is another RNA sequence and so on.

Â So, okay, it seems like words, like strings.

Â What we say in computer science.

Â And it seems that the biologist is using special char letters here.

Â A, U, C, G.

Â I am not seeing, I am not seeing any other letters.

Â The biologist tells me yes.

Â An RNAs, an RNA sequence has only one,

Â each one of the letters in that sequence is just one of four letters A, U, C, G.

Â 2:41

What is a string?

Â What is a string?

Â We can define that recursively.

Â Okay?

Â So I can, if I,

Â if this is a string, I can, add another letter to it, it will be a string.

Â I can add another letter, another letter and so on to it and it will be a string.

Â So if I, if, if this is my segment.

Â Okay? Big Sigma which is the alphabet.

Â I can define the set of strings recursively if I denote by Sigma* the set

Â of all strings over an alphabet, I can think about the base case.

Â 3:47

W sigma is a string where sigma is in sigma.

Â Alright?

Â So if I want to apply this definition to this, I say okay, the,

Â if I want to start listing the elements in this, I say okay, the first one.

Â Base case tells me that the empty string is there.

Â Then I can apply the recursive step that states if you look,

Â if you take a string from here and start appending letters to it from sigma,

Â you will get, you'll get a string there.

Â So this notation here is what we call concatenation, and

Â I don't need to go into formal definition here.

Â Think about it as if I give you two strings, you just.

Â Glue them together.

Â Or here I'm talking about one symbol.

Â I append it at the end to the string.

Â So if I take a string and ap, if empty string.

Â A special case is that if you take the empty string an append any letter to it,

Â you just get that letter, okay?

Â So I can take the empty string and append A to it, I get A.

Â I append U, I get U.

Â I append C or G.

Â Then I can take these and I start appending letters to them.

Â AA, AU, AC, AG, and so on.

Â This is an infinite set.

Â I can create an infinite set.

Â But this is the finite representation of that set.

Â It is a recursive definition of this set, or

Â how the elements in this set are formed.

Â Okay?

Â So now I know that what the biologist is talking about is the set

Â of strings over this alphabet that I can define mathematically like this.

Â Okay?

Â So now I am starting to zoom in on what the biologist has in terms of data.

Â The biologist is telling me that the data they have are these kind of sequences.

Â 5:30

Then I say okay, I understand now what, what the input is.

Â You are going to give me a sequence like this, okay?

Â You are going to give me a sequence and I understand what it is,

Â I understand what the letters are, I understand how it, how it is formed, okay?

Â Then say what is the output, what is it that you want from me?

Â He says I want you to predict how this sequence is going to fold.

Â On itself.

Â So what does it mean to fold?

Â It said you take take letters from here and basically match them, or

Â you can take this sequence and draw it such that you can,

Â for example, match this with this, this with this, this with this, this with this.

Â So in this case here, what happened is that we have A,A.

Â 6:12

A ,A, U.

Â U, A, C, C, G.

Â Okay, so the sequence followed like this.

Â So this letter went with this letter, this went with this,

Â this with this and this with this.

Â So I go to the biologist.

Â I say.

Â Okay, so this is what you mean, the biologist says something similar to

Â this but not exactly, what don't I then go back to the biologist, what is it th at

Â you don't like about this, he says, the biologist says, for one thing, for one

Â thing, when I look at the letters that you matched together, A has to go with a U.

Â And C has to go with the G.

Â Nothing else.

Â I cannot put A with C, A with G, U with C, U with G, and so on.

Â Either A goes wi, A with U, or C with G, nothing else.

Â So if I look at this, the biologist doesn't like it.

Â A went with G, A went with C, U with C, this is only one actually,

Â that the biologist might like, U with A.

Â 7:19

One other thing is that the biologist said that, I have a sharp turn here.

Â That, you know it's like I'm breaking the sequence and folding it under itself.

Â The biologist says that this doesn't happen,

Â they haven't observed that in biology.

Â What they've observed is when a sequence folds in on itself.

Â It actually forms a loop here.

Â So some letters do not match with each other.

Â Okay?

Â So, and I say okay what is the size of this loop or

Â what is the minimum size of this loop?

Â The biologist says that if you match if you

Â match a symbol with another symbol here.

Â They have to be separated by at least four positions, so

Â if I match this with this, I need to count 1, 2, 3, 4, so this is fine,

Â matching this the second with the seventh here is fine, okay?

Â So this is 1, 2, 3, 4, 5, 6, 7, 8.

Â It is fine to match 2 with 7,

Â it is not fine to match 4 with 5 because these two are not separated.

Â By at least four positions so basically the biologist says if we match

Â positions i and j,

Â then i has to be smaller than j minus 4, right?

Â So if I look at this is 2 and 7, 7 minus 7 minus 4 is, is 3.

Â So this is 2.

Â This is 3.

Â So it is true.

Â If the, if, if j was 6.

Â Okay?

Â So 2 smaller than which is 4.

Â So I cannot match 2 with 6.

Â So if I match i with j, i has to be smaller than j minus 4.

Â Okay?

Â So we have this which the biologist is calling [SOUND] no sharp turns.

Â [SOUND] And then here, this is complementarity.

Â [SOUND] Okay?

Â [SOUND] I say all of these are only conditions, the biologist says no.

Â The third one is [SOUND].

Â That this needs to be some sort of a matching.

Â The biologist tells me I cannot match the same letter more than once.

Â I cannot say this A is going to match to this G and

Â this A is going to match with C.

Â No. If I match this A to

Â this G I am done with this A.

Â i cannot match it with anything else.

Â What this means that.

Â If IJ is a matching pair of

Â positions, then INJ do not participate.

Â 9:56

In any other pair.

Â Okay?

Â So this is condition number 3.

Â So biologists has told me A with U, C with G.

Â Nothing else.

Â They say that each two positions I put in the matching,

Â they have to be separated by at least four positions.

Â The third one that the biologist tells me is that if I choose to

Â match this a with this g.

Â This a and this g are taken, I cannot use them again.

Â Okay is this, is this all?

Â 10:48

So this, the biologist says this cannot happen.

Â Or in other words, if I have I, J, if I tell the biologist I'm

Â going to match position I with position J in the secondary structure.

Â And I match k with with m, but

Â I will just say that it cannot be that i is smaller than k is smaller than j, okay?

Â They have to be either by smaller than j, something like this, okay?

Â 11:20

This is fine, or it can be i smaller than k smaller than l smaller than

Â j which is i j and we have k l, okay?

Â So now the biologist has told me four conditions.

Â That A goes with U.

Â C with G.

Â The biologist told me that i and j.

Â If I tell the biologist here is your secondary structure.

Â I am giving you the pairs of positions that match.

Â The i and j should be separated by at least four positions.

Â The third one that if i match, if i no element,

Â no position appears in more than one pair.

Â It can either appear in one pair or none.

Â And the, the fourth one is that, I cannot have this crossing situation, okay?

Â It's either ij encompasses kl, or they are disjoint, okay?

Â So now, the biologist has defined for

Â me, the feasibility conditions, that when I take as an input, a sequence.

Â Which I understand what it is,

Â a sequence that is an RNA, a string over the alphabet A,U,C,G.

Â The biologists want back from me a set of pairs, you know,

Â IJ, KL, MN and so on, where each pair here is, is, two positions in the sequence.

Â To positions in the sequence, such that the letters at these twp positions for

Â every pf, the letters at the positions have to be either e n u or c n g,

Â nothing else, such that, such that for every pl, if I look at i,

Â n, j, k, l m and n, each pair is separated by at least four, so i is

Â smaller than j minus 4, k is smaller than l minus 4, m is smaller than n minus 4.

Â So the third condition is that if I look at these pairs, they don't share

Â any element in common, and the fourth one is that I have this property, that for

Â every two pairs, they are either disjoint like this or they are one with,

Â one is contained within the other.

Â Okay?

Â So now I understand what is the input.

Â Which is a, a string over A,U,C,G.

Â And I understand what is the output, it's a set of pairs with some conditions.

Â Okay? With some constraints.

Â So now we have defined the problem in terms of feasibility.

Â 13:49

Then, is this sufficient for

Â me to answer, to answer the question that the biologist want?

Â The answer is no, because I can find, it might be the case that there will be

Â tons of ways, tons of such sets that will satisfy this property.

Â For example.

Â The most trivial answer, that I can give to the biologist is, the empty set.

Â If I give the biologist the empty set, it is satisfying this, it has no pairs, so

Â I am not going to violate this.

Â It has no pairs, I'm not going to violate this.

Â It has no pairs, I'm not going to violate this.

Â It has no pairs, I'm not going to violate, violate this property.

Â So I can tell the biologist I have a trivial solution for you.

Â 14:27

If you give me a pri, you give me an RNA sequence.

Â I am going to give you an empty set.

Â It satisfies all the conditions.

Â Then the biologist revised their answer.

Â They said, oh I would want of all these possible solutions that you

Â can give to me.

Â I would want the one that has the maximum numbers of pairs.

Â Okay?

Â So if I give the empty set the biologist says I'm not happy with that because I

Â can find another set that has at least one pair.

Â And then someone else can come with another solution that satisfies all these

Â four conditions, but has two pairs.

Â And someone can come to 3, and 4, and 5, and so on.

Â So now, I understand what is the problem exactly that the biologist wants to solve.

Â The biologist wants to give me an RNA string, a string over A,U,C,G.

Â The biologists want a set of pairs, where each pair is a,

Â a pair of indices from that string, such that I satisfy these properties.

Â These are feasibility conditions.

Â And the biologist has also some optimality criterion in mind.

Â That optimality criterion is that.

Â The biologist would want the set with a maximum possible number of such

Â peers/g; that via, that, that, satisfy this constraint.

Â And this is our [INAUDIBLE] structure problem in our case okay?

Â So it is a problem where we're given a, a, s,

Â string over A,U,C,G we want to get back the output is is a set of these peers.

Â This set of pairs, we have to satisfy feasibility conditions,

Â the four conditions we listed here, and it has also to satisfy some of the criterion,

Â that over all possible sets of pairs that satisfy these conditions,

Â we would like to give a set that has the maximum number of pairs.

Â Okay? This is the RNA secondary

Â structure problem.

Â