我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

Loading...

来自 约翰霍普金斯大学 的课程

DNA 测序算法

278 评分

我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

从本节课中

Edit distance, assembly, overlaps

This week we finish our discussion of read alignment by learning about algorithms that solve both the edit distance problem and related biosequence analysis problems, like global and local alignment.

- Ben Langmead, PhDAssistant Professor

Computer Science - Jacob Pritt

Department of Computer Science

At the end of the previous lecture, we wrote down this recursive function for

finding the edit distance between two strings.

But there's a problem.

It's very, very slow.

So here for example,

is some Python code that runs in our function as you can see in red.

To find the added distance between two strings.

And then this sort of gray code on either side, sort of above here and

below here, is just there to measure how much time it takes to run our function.

So at the end, you see the output of the function, which is three.

Which is the added distance between those two strings.

It's not hard to see that.

And then the number you see here is the number of seconds that it took

to run the function.

So it took about 31.5 seconds to run this function, and that's a very long time

to wait to try to find the edit distance between these two pretty short strings.

So why would it be taking that long?

So here's a little hint.

Let's say that this is the initial call that we make to the recursive function,

to the edit distance function.

We want to find the edit distance between these two strings.

And so this initial call will make three recursive calls.

Each with different arguments, like you can see here.

Now this recursive call, this one over here,

will make three more recursive calls, right like this, etc, etc, etc.

So for simplicity, I've only really drawn a tiny little bit

of the overall tree of recursive calls that would happen.

But just looking at these seven calls that are visible right now,

you'll notice that two of them are the same.

So they're the same in the sense that two arguments to the function are the same.

And so, of course, the result returned by those two calls will also be the same.

And that's rather wasteful, right?

If we're going to through the trouble of running the function, and

it seems like we should at least remember what the answer is.

So that if later on we make the same exact call, we call the function

with the same arguments, then we can just remember what the answer is.

Instead of running the function all over again, which might take a lot of time.

So here's another demonstration that gives a clearer idea of just how

big the problem is.

Here's our editDistance function again.

I'm writing it out again.

But there's a little bit of added code, which I'm showing here in red.

And all that added code is doing is it's just keeping track of

how many times the function gets called with a particular set of arguments.

So a particular pair of arguments, this pair of arguments here, right.

So we're essentially just adding up how many times this function is called for

that particular pair of arguments.

And it's storing the result in this variable called n.

So if we find that n is large, that means that we're being very wasteful, right.

We're doing the same call over and over again.

So if we call this new function and we give it our two arguments that we

gave it before, we get the answer three, which is correct.

But then now we can also print out the value of this variable n,

which tells us how many times it was called with those particular arguments.

And we get this large number, right.

It's almost 9,000 this number here.

So each time, for each of those 9,000 times or

so, we had to call and run the function all over again which was very wasteful.

We only really needed to run it once.

And then for all of those subsequent times when we wanted to know that edit distance,

we could have just returned the value that we remembered from before.

So let's rewrite the edit distance function,

in a way that avoids doing any of this redundant work.

So the way that we're going to rewrite the function is in terms of

a matrix that looks like this.

So in this matrix, the characters of the string x

are labeling the rows of the matrix, like you see here.

And the characters of the string y are labeling the columns of the matrix,

like you see here.

And each element of the matrix corresponds to a particular prefix of x and

a particular prefix of y.

The first row and the first column are both labeled with epsilon,

which just stands for the empty string.

So this is because we want to consider all prefixes, including the empty prefix,

the prefix that has length zero.

Over the course of our algorithm, our edit distance algorithm that we're going to

write down, we're going to fill in every element of this matrix.

And we're going to do it, we're going to fill in each element

with the edit distance between the corresponding prefixes.

So for example, the value that will go in this element here is going to be the edit

distance between the corresponding prefix of x, which is GCG, and

the corresponding prefix of y which is GCT.

So this element will contain one after we're finished filling in this matrix.

And the element that's all the way down here in the lower right hand corner,

that element is special because we will eventually write into that

cell into that element.

The edist distance between the entire string X and

the entire string Y, which is what we're after.

That's the edist distance the edist distance between X and Y.

So we can think of each element of the matrix as a different

pair of arguments to our recursive edist distance function.

And every possible pair of arguments is only going to be represented

once in this matrix.

We're only going to calculate it once.

And that's how we're going to avoid the problem of redundant work

that we saw before.

So we're not done yet of course.

We need a way of figuring out what are the edist distances

that we should put in the elements of this matrix.

So we're going to do so with the help of that same expression

that we wrote down in the previous lecture, this expression down here.

And, we'll see in a moment exactly how we use it.

But when we fill the matrix, we'll start by filling the elements that correspond to

short prefixes of X and Y up here.

Fill in these first.

And ultimately and finally we'll fill in the elements that correspond to

the long prefixes of X and Y down here.

So let's think about how to use this expression

to fill in a particular element of the matrix.

So let's pick an element.

Let's pick this element right here, the one highlighted in blue.

And in this expression, the three cases,

these three things that we're taking the minimum of, the three terms.

They correspond to three other elements of the matrix.

The first case course responds to the element that's above and

to the left, diagonally from the element.

Because we're recursively calling the edist distance a function

with a prefix of the first string that's one character shorter.

And a prefix of the second string that's one character shorter.

So that term corresponds to shorter prefixes of X and

Y, shorter by one character.

And that corresponds to the element of the matrix that's highlighted in red.

The second term here,

corresponds to the element of the matrix that's highlighted in green.

So that is taking the edit distance of the entire string X and

then it is a shorter prefix of the string Y.

So that's the element that's to the left.

And then this term corresponds to the entire string y, but

a prefix of the string x.

So that corresponds to the cell above.

So this color coding that you see here, shows you the correspondents between

terms in the expression and elements of the matrix.

So say that the matrix has already been partly filled in with edist distance for

all possible prefixes shorter than the ones that correspond to our blue

element of the matrix here.

In other words, we've already filled in all these elements like this and

now we'd like to calculate the value of the blue element right here.

So we can start by taking the value that's in the cell diagonally above and

to the left, which is highlighted in red.

And then we're going to add the delta function of the corresponding characters

from x and y.

So that's going to be the delta function evaluated of t and g.

Now T and G are not equal, so the delta function is going to evaluate to 1.

So that diagonal term is going to equal the edist distance in the diagonal cell,

which is 0, plus the delta function, which is 1.

So that equals 1.

So this second term tells us to take the value of

the element to the left, and to add one.

And the value of the element to the left is one.

So when we add one we get two.

And likewise, this third term is telling us to take the value of the element above,

and add 1, which again is 2.

So when we take the minimum of these three things, we're going to get 1.

And that 1 corresponds to if we had gone diagonally.

In other words, if we had taken the edit distance between the prefixes and then

added 1, corresponding to the substitution that we would need to turn T into G.

So, that's how we fill in the cells of the matrix, except I haven't yet

told you how to fill in the first row and the first column.

But that's not too hard, because the first row and the first column correspond

to the edist distance between the empty string and something.

And the edist distance between the empty string and

some other string is always equal to the length of the other string.

So in other words,

the way we would initialize the first column would be with ascending integers.

0, 1, 2, 3, 4, 5, etc.

And same thing for the first row.

0, 1, 2, 3, 4, 5, etc.

So here's the matrix with all of the elements filled in.

And this matrix is now telling us what the edist distance between the two strings is.

It's just the value that's all the way in the lower right hand corner,

which in this case is 2.

So the edist distance between x and y is 2.

So how fast is our new algorithm for finding the edit distance?

The recursive algorithm was slow.

How fast is this algorithm?

It's very very fast.

So the run time is just going to be a fraction of a second,

way faster than the recursive formulation that we were using at the beginning which

took over 30 seconds to do the same thing.

And why is it so much faster?

Well, it's really because our new method for

calculating edit distance avoids the problem we had before.

Which was that the function might be called over and over again for

the exact same arguments.

And it would run the function all over again,

even though we already knew the answer for that particular pair of arguments.

So with our new method, for any given pair of prefixes of x and

y, we will calculate the edit distance of those prefixes once and only once.

And we'll record the answer in the matrix and

we'll never recalculate that same edist distance.

So this kind of algorithm,

where we decompose the overall problem into smaller problems, while also

avoiding ever redundantly recalculating one of those smaller problems.

This kind of algorithm is called a dynamic programming algorithm.

And dynamic programming is a common algorithm design technique.

But this dynamic programming paradigm is actually going to be very useful.

It's useful for more than just finding the edist distance between two strings.

It's also useful in other biosequence analysis applications,

as we'll discuss in future lectures.