0:02

We've seen dynamic programming algorithms for finding the edit distance between two

Â strings and also for finding approximate occurrences of one string inside another.

Â In this lecture, we'll look at a couple of variations on the theme

Â of dynamic programming for edit distance and we're going to use it to solve two

Â different problems called global alignment and local alignment.

Â These are two very commonly solved problems in lots of

Â biosequence analysis tools and they're both relevant for

Â our target problems of read alignment and genome assembly.

Â 0:37

Edit distance penalized all the different kinds of edits that we might find

Â the same amount, right?

Â Whether it was a mismatch, or an insertion, or a deletion,

Â edit distance would penalize that at the unit of 1.

Â Each one of those things incurred a penalty of 1.

Â So there is no difference, for example, between the penalty for insertion versus

Â a substitution, and there is no difference between the penalty for a substitution

Â that turned an A into a C versus a substitution that turned an A into a T.

Â 1:09

So it doesn't always make sense though,

Â to penalize all these differences the exact same amount.

Â So what if it turned out, for example,

Â in practice that gaps were a lot less frequent than substitutions?

Â Well, then it might make sense to penalize the gaps more than the substitutions.

Â Or what if it turned out that certain substitutions,

Â certain base to base substitutions were more likely than others?

Â Then we might want to penalize them less than those others.

Â 1:39

So in fact both of these things are the case.

Â Some DNA substitutions are more likely than others.

Â So for example, we can divide all possible DNA substitutions

Â into two categories called transitions and transversions.

Â And to understand what these are, we first have to understand that the bases A and

Â G, adenine and guanine, both belong to a category called purines,

Â and then the bases C and T both belong to a category called pyrimidines.

Â And for substitutions that change purine to another purine or

Â that change a pyrimidine to another pyrimidine,

Â those kinds of substitutions are called transitions.

Â And then all other kinds of substitutions are called transversions, and so

Â if you just enumerate the possibilities you'll see that there are twice as

Â many kinds of transversion, as there are kinds of transitions.

Â So you might think that transversions are going to be something like twice as

Â frequent as transitions.

Â But it turns out that, in reality,

Â if you look at the substitutions that differentiate, say, the genomes of two

Â unrelated humans actually transitions are about twice as frequent as transversions.

Â So it's the other way around from what you would expect.

Â So if transitions are so much more frequent than transversions,

Â it seems like in our penalty scheme we might want

Â to penalize the transversions more than we penalize the transitions.

Â 3:07

Another example is small gaps, so small insertions and deletions in the genome.

Â So it turns out that if you take the genomes of two unrelated humans, and

Â you just count up how many differences there are that are substitutions,

Â versus how many differences there are that are small insertions or

Â deletions, Insertions or deletions of just a few bases.

Â You find that the substitution rate is something like 1 in a 1000 bases,

Â whereas the indel rate, indel meaning insertions and deletions.

Â The indel rate is something like 1 in 3000 bases or so.

Â So indels are less frequent than substitutions.

Â So we might want to penalize indels more than substitutions.

Â So, we want to move beyond edit distance where every kind of difference

Â receives the same penalty.

Â And instead, we want to assign different penalties to different kinds of mutations,

Â different kinds of events.

Â So here, for example, is what's called a penalty matrix.

Â 4:05

So this matrix has an element for

Â every kind of penalty that might be incurred in an approximate match.

Â Some of the elements correspond to substitutions like these yellow and

Â green elements.

Â And some of the elements correspond to insertions and

Â deletions like these pink elements.

Â 4:21

But a key point is that we can set the elements of this matrix in whatever way is

Â appropriate for our biological context.

Â So, if we're working with DNA and we're comparing the DNA of two different humans,

Â then we might use a substitution matrix like this where transitions receive

Â a lower penalty than transversions.

Â And then transitions and transversions receive a lower penalty than gaps.

Â 4:45

So remarkably, we barely have to change our edit distance algorithm at all in

Â order to accommodate this new penalty matrix.

Â When it comes to filling in our dynamic programming matrix,

Â all we really have to change about the algorithm is we have to change the value

Â that gets added on when we calculate the three contributions, the diagonal,

Â vertical, and horizontal contributions.

Â In the case of edit distance, for an insertion or

Â a deletion, for a vertical or a horizontal move through the matrix,

Â we just added 1 onto the edit distance of the shorter prefixes.

Â Now we're going to add a value that we look up in our penalty matrix.

Â Also, before when we were considering a diagonal move,

Â we would use our delta function, basically to figure out whether the characters from

Â x and y were equal, in which case the delta function evaluates the 0, or

Â whether they were not equal in which case the delta function evaluates the 1.

Â Now we'll just again,

Â do a lookup in our penalty matrix to determine what the penalty should be.

Â 5:45

Apart from this, the algorithm is pretty much identical to the algorithm for

Â edit distance.

Â Both in terms of how we fill the matrix and

Â also in terms of how we find the trace back.

Â So that's global alignment.

Â So here's a global alignment matrix filled in.

Â Here's a trace back.

Â It's all very much analogous to what we did with edit distance.

Â 6:18

So local alignment, on the other hand,

Â is a somewhat different kind of problem that's being solved.

Â Although, surprisingly, it will turn out that the solution

Â is quite similar still to the solution for edit distance.

Â 6:32

So in the local alignment problem, we're not trying to find the distance between

Â two strings, and we're not trying to find occurrences of one string inside another.

Â Here, the problem we're trying to solve is,

Â given these two strings X and Y, we'd like to identify the substring of X and

Â the substring of Y that are most similar to each other.

Â And we'll define more clearly later what we mean by most similar to each other, but

Â first, let me show a quick example to motivate this local alignment problem.

Â 7:00

So here are two strings.

Â They're in English so that the similarities are more obvious to us.

Â And the question is, can you find a pair of substrings, one from X and one from Y,

Â that are the most similar?

Â 7:14

So looking at this example it seems like these substrings in the middle here

Â are the most similar.

Â So they're at hemming one distance from each other but

Â they have nine matching positions with each other.

Â So this is approximately the kind of problem we're solving

Â with local alignment.

Â 7:32

Now, in a sense this seems pretty difficult compared to edit distance

Â because we have to somehow consider all possible pairs of substrings from X and Y.

Â And if you just consider how many possible pairs there are,

Â it's going to be related to the product of the squares of the lengths of X and Y.

Â So that's a lot of pairs that we might have to be sifting through, but

Â it's going to turn out that actually the mount of work we do is no different from

Â a global alignment problem of the same size.

Â So, here's the recurrence for local alignment here.

Â 8:16

And instead of using a penalty matrix, we're going to use something similar which

Â we'll call scoring matrix which is shown in lower right.

Â And an important difference between the scoring matrix and the penalty matrix is

Â that in the scoring matrix we're going to give a positive bonus for a match.

Â And then we're going to give a negative penalty to all the different kinds

Â of differences.

Â So in this case,

Â substitutions are going to have a penalty of -4 and gaps have a penalty of -6.

Â 8:55

So, intuitively, this is because the goal of local alignment is to find parts of X

Â and Y that match closely enough that they sort of pop out from the background of 0s,

Â from the background of dissimilarities.

Â And so you'll see, for example here,

Â the values that are non-0 are mostly over here sort of in this region here.

Â 9:16

And if we want to find that local alignment,

Â the one that gives the best similarity between two substrings,

Â first we're going to look for the element in the matrix that has the maximal value.

Â And in this case that's this 12 right here.

Â That 12 is the maximal value.

Â 9:33

And that corresponds to the optimal local alignment.

Â And then if we want to know which substrings are involved in that alignment,

Â and exactly what the shape of that alignment is,

Â we can use our usual trace back procedure.

Â We'll start in the cell that's labeled 12.

Â We'll do our usual trace back thing.

Â The only way that we'll modify our typical trace back procedure,

Â is that we're going to stop as soon as we reach an element whose value is 0.

Â 10:01

So once we've done that,

Â we've now, first of all we've figured out which are the substrings from X and

Â Y, that are the most similar, they're highlighted here in green.

Â So here's the substring from Y, and here's the substring from X, and

Â we also know the shape of the alignment between those two substrings shown here.

Â 10:18

So this is a very useful tool.

Â Both global alignment and local alignment are very useful tools.

Â And they're used in many different kinds of biosequence analysis tools including in

Â realignment tools and assembly tools.

Â