我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

Loading...

来自 约翰霍普金斯大学 的课程

DNA 测序算法

278 评分

我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

从本节课中

Edit distance, assembly, overlaps

This week we finish our discussion of read alignment by learning about algorithms that solve both the edit distance problem and related biosequence analysis problems, like global and local alignment.

- Ben Langmead, PhDAssistant Professor

Computer Science - Jacob Pritt

Department of Computer Science

We just discussed edit distance which allows us to find the edit distance

between two strings using a powerful dynamic programming algorithm, so

now we want to apply this same sort of idea to the approximate matching problem,

to the problem of finding approximate occurrences of a pattern P in a text T.

So shown here is the sort of matrix we talked about in a previous lecture for

finding the edit distance between x and y, and here the characters of x are labeling

the rows of the matrix and the characters of y are labeling the columns.

But now we're going to imagine a new kind of matrix like this

where the rows are labeled with characters of our pattern P, and

the columns are labeled with characters from our text T.

And, one change that we'll make to our edit distance algorithm,

is that we'll initialize this matrix a little bit differently,

so before we initialized the first row and the first column with ascending integers,

zero, one, two, three, etc.

This made sense because if we're talking about the edit distance between,

on the one hand, the empty string and on the other hand a string of length three,

for example, then that edit distance has to be three.

For approximate matching we're instead going to initialize the first row with

all zeroes.

The first column we'll initialize as we did before

with ascending integers like this.

So why do we initialize the first row with all 0s?

This is because we don't know ahead of time where P is going to occur within T,

so every offset here is equally likely, and so

by filling the first row with all 0s we're not biasing ourselves toward

any particular offset where P might occur within T.

If this point isn't obvious right now,

it'll probably become clearer after the description of the algorithm.

So once the matrix is initialized,

we simply fill in the rest of the matrix in exactly the same way as we did for

the edit distance problem, and now that we've filled in the matrix,

we'd like to identify where are these approximate matches of P within T?

So, in this case, by looking at the matrix,

we learn that there must be a two edit occurrence of P within T,

that that is the occurrence with the fewest edits.

So we noticed by looking in the final row of this matrix, and

in the final row we find a 2.

And that tells us that there must be some substring of T

that matches our pattern P with two edits.

2 is the minimal value in the final row, so

this corresponds to the closest match between P and some sub string of T.

So, that tells us that P matches somewhere with two edits.

What we don't know yet is we don't know how to figure out exactly where that

substring of T is that it matches.

And to answer that, we first have to answer the question, how did we get here?

How did we get down to that value in the bottom row of 2?

So in the first step, we can wonder something very specific.

So how did we put this 2 in this particular element

when we calculated the minimum over those three terms?

So we ended up with a 2, but

in this case it must've been that we came diagonally from above and to the left.

So had we come either horizontally or vertically, the value would've been 4,

because the values above and to the left are both 3, and

we have to add 1 in both cases to account for that final gap.

So if we had come horizontally or vertically, the answer would be 4, but

because it's not, we can conclude that we must have come diagonally.

So that means that we got to the blue cell from this red cell.

So now we can ask this same question of that red cell.

How did I get here?

And again, we can look at the vertical and horizontal and diagonal contributions.

The vertical and horizontal contributions both would

be 3 because both the cell above and the cell to the left have a 2 in them.

We would have to add 1 for

adding on that final gap, so they would both contribute a 3.

But the movement from diagonally from above and

to the left would only add one, so we must've come diagonally.

And so we can repeat this kind of procedure to go cell by cell backwards,

recreating the way that we got the value in each cell.

Once we get to this cell here, we'll ask the same question, and

instead of concluding that we came diagonally,

this time we'll conclude that we came vertically.

So we'll look again at those three contributions.

This time the diagonal contribution actually is

greater than the vertical contribution.

The diagonal contribution would be 2, because there's a 1 in the cell diagonally

above and to the left, and the delta function would evaluate to 1 in this case

because the C in the text mismatches the G in the pattern if you look

at the character labeling that column and the character labeling that row.

And yet the cell that's vertically above has a 0 in it, and so

we would add 1 to account for the final gap.

And that would be the minimal value.

So in this case we came from above, but

we can then repeat this procedure to get all the way back up into the very top row.

So now we know the full answer to the question,

how did we get to this 2 here in the bottom row?

We followed basically this path shown here with the purple arrow.

So this is called, this path, is called the trace back, and

the procedure that we followed to find this path is called tracing back.

And it's as though we're sort of retracing our steps to figure out how we got

to the value that was placed in that cell in the bottom row.

Once we have the trace back,

we know exactly which substring of T matches with P with two edits.

It's the substring that begins at offset 5 with respect to T.

We know this because this element here is the first one

where we've aligned a character from T to a character from P.

So in this case we've aligned this G,

the G at offset 5 in the text T with the very first G of P.

The other thing it tells us is the shape of the alignment, specifically,

it tells us where the differences are between P and T.

So for example, the vertical part, of this trace back here,

corresponds to the only gap in the alignment, and so we can

write down the alignment that corresponds to the trace back, like you see up here.

And here I've highlighted the matches in green, and the differences in red.

And so here is the gap that corresponds to the vertical move in the trace back,

the vertical step in the trace back.

So the overall shape of this alignment

follows the overall shape of this trace back here.

So this is how we use edit distance to solve the approximate matching problem.

We fill in a matrix just like this, and we can look in the final row to detect

occurrences of P in T with some number of edits, and

then we can use the trace back to identify the exact location

where that match occurs and what the shape of the alignment is.

So, a big problem with methods like this is that they can be quite slow.

So, the amount of work that we had to do to solve this problem was proportional to

the number of elements in the matrix which in turn is proportional to the number of

characters in P times the number of characters in T.

So this can add up to quite a lot of work, and if you compare the amount of work

we're doing here to the amount of work we were doing with Boyer-Moore,

the amount of work we did there was proportional to the length of the text or

approximately, it was just proportional to the length of T.

So in fact for some variations of Boyer-Moore, the amount of work we do is

strictly, at worst, linearly proportional to the length of the text, and

then furthermore, the longer the length of the pattern with Boyer-Moore,

the more skipping that we could do.

So, this is not really a fair comparison because Boyer-Moore is only capable of

exact matching, at whereas what we're doing here,

this algorithm that we just learned is capable of approximate matching, but

this makes this algorithm potentially an impractical thing to use on its own for

solving problems like for example the read alignment problem.

In practice, edit distance-based approximate matching tends to be used in

combination with other techniques like techniques that we've learned about using

an index, using the pigeon hole principle, things like this.

So, by combining those two kinds of techniques we can get the flexibility of

this edit distance algorithm, which naturally handles both mismatches and

gaps for example, and we can combine that with the speed that comes

from using an index or otherwise using the pigeon hole principle to quickly filter

out a large number of alignments that won't end up with a good edit distance.