我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

Loading...

来自 约翰霍普金斯大学 的课程

DNA 测序算法

279 评分

我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

从本节课中

Edit distance, assembly, overlaps

This week we finish our discussion of read alignment by learning about algorithms that solve both the edit distance problem and related biosequence analysis problems, like global and local alignment.

- Ben Langmead, PhDAssistant Professor

Computer Science - Jacob Pritt

Department of Computer Science

In this practical we'll be implementing edit distance using dynamic programming.

So I've already pasted here the recursive edit distance function

which was covered in lecture.

However, this function is pretty slow so

we're going to rewrite it using dynamic programming.

So I'm going to create a new function

at a distance which will take two strings, x and y.

We're going to compute the edit distance between x and y.

So the basic idea of dynamic programming is we're going to create a matrix that

we're going to fill in with the edit distances for the substrings that we

already computed to save time, so we don't have to recompute them multiple times.

So I'm going to create a matrix D,

and I'm going to initialize this as an x by y array of zeroes.

Or, sorry, an x + 1 by y + 1 array of zeroes,

because we want to start from the empty string.

So, this will initialize D as an x + 1 by y + 1 array of zeroes.

And now, we want the first column of that array to be zero, one,

two, three, four, and so on.

And we want the first row similarly to be zero, one, two, three, four and so on.

So I'm going to say for i in range up to length of x.

Should be x + 1.

>> These are initialized this way because any prefix of say length x and

an empty prefix.

The added distance between those two is just going to be x.

So that's why we initialize the first row on the first column in that way.

>> Right.

So now we want to step through and fill in the rest of our matrix.

So I'm going to go row by row.

So for each row n for each column.

Now we need to compute the value that will go at this location in D.

So we can get to this location either coming from the cell directly to the left,

which would correspond to a skip in y.

The cell directly above which would correspond to skipped character in x,

or the character above and to the left.

So, if we move horizontally from the cell to the left, this would be,

the edit distance of this cell would be value of the cell right to its left.

So i j + 1, + 1 since the penalty for a skip character is one.

And similarly, we can calculate the vertical distance that would be coming

from the cell above it.

This would be the edit distance of the cell above it, + 1.

For that skipped character.

And now, if we come diagonally, the penalty for

the edit distance depends on whether the characters match.

If the two characters right here match, then the edit distance won't increase.

So I can say if x(i- 1) equals

y(j- 1), then the diagonal

at a distance is just going to be the same as the value above and to the left.

But if these characters don't match, then we have to add one for

the penalty for that.

So now that we have three possible edit distances for this cell.

We want to minimize this, so

we're just going to take the minimum value for this cell.

So I'm going to say min of the horizontal,

vertical, and diagonal at a distance.

So we'll finish this loop.

This will populate the entire matrix with edit distances, and

then we want to get the value in the very bottom right, get the edit distance for

the entire strings against each other.

So I can just do that.

That will return the very bottom right value of that array.

And this is our edit distance function.

So let's test this out.

I'm going to start by running the edit distance recursive function.

And we'll see how long that takes and

then we'll compare the time against our new dynamic programming version.

So this line will just tell iPython to print out

the time that it takes to run this.

And say x equals.

Just give it two strings that are somewhat similar here And then I'm going to

Calculate the recursive edit distance for x and y.

And this is running.

You can see it's already taking a bit longer than we would expect.

These strings are only about ten characters long.

Takes about five seconds, and the edit distance is three.

You can see we have a upper and lower case character here.

We have a s matched up against the space here, or, that would be an inserted

space here, and then an inserted r here, for an edit distance of three.

So now let's do the same thing, but use our dynamic programming version.

When I run this, it computes in about 200 microseconds,

I think, so that's really fast.

>> That's better.

>> Edit distance is the same so

it works, so that's the dynamic programming version of edit distance.