我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

Loading...

来自 约翰霍普金斯大学 的课程

DNA 测序算法

280 评分

我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

从本节课中

Algorithms for assembly

In the last module we began our discussion of the assembly problem and we saw a couple basic principles behind it. In this module, we'll learn a few ways to solve the alignment problem.

- Ben Langmead, PhDAssistant Professor

Computer Science - Jacob Pritt

Department of Computer Science

In the last practical, we implemented the brute force

method of calculating the shortest common superstring.

The problem with this method is that we have to compute the superstring for

every permutation of reads, and, as our list of reads grows,

this will become exponentially large and it will take forever to run.

A better way of doing this is using the Greedy algorithm,

which is going to be approximate, it will get somethings wrong.

But it will be much faster, and much more feasible to run in practice.

So, I've already pasted in here the overlap function,

which we've seen a lot of times.

I've pasted the shortest common superstring

function we wrote in the last practical, and

now we're going to implement the Greedy, shortest common superstring function.

So the way this is going to work is we're going to to look our set of reads, and

we're going to first find the two reads with the maximum overlap.

And then we're going to combine those two reads, and we're just going to repeat

that until we've combined all the reads into a single thing.

So a helper function I'm going to write is pick maximal

overlap, and this will take a set of reads and a minimum overlap, K, and

it will return the two reads with the maximum overlap along with that overlap.

So, to do this I'm going to define read A and

read B to be none, and

then I'll keep checking my best overlap length, which initially will be zero.

And I'll say for each pair of reads A, B,

in intertools.permutations, two.

So I'm going to take pairs of reads.

So for every pair of reads, I'm going to calculate the overlap length.

And then if this overlap length is larger than the best overlap length that

we've seen so far, store these reads as our best ones that we found.

So I'll say if overlap length is greater than best overlap length.

And best overlap length equals overlap length.

And when we're done all this, we'll just return the best read A,

read B, and overlap length that we found.

>> So one of the things that might happen is there might be multiple pairs that

are tied for the longest overlap and the way we've written this function basically,

just the first one we encounter,

the first one encountered by the loop is the one that we're going to import.

Right.

So now let's write our Greedy shortest common superstring function.

This one will also take a set of reads and minimum of overlap K.

Like I mentioned, we're going to start off by just calculating the two reads,

A and B, with the best overlap, the maximum overlap that we can find.

So I'll use my pickMaximalOverlap function.

And now, I'm going to put a while loop,

as long as the overlap length is greater than zero.

We're going to replace reads A and B, in the set of reads, with their overlap.

So I'll do reads.remove(read_a).

Remove read_b.

And now we'll append to this list the combination of these reads.

So that will be read A plus the suffix of read B,

since we don't want to contaminate the overlap region.

And so what we've done is we've found the two reads in our set that have the best

overlap, and we replaced those two reads with the overlap read.

And we're going to keep doing this until we have one read left,

which will be our shortest common superstring.

So now I just need to recalculate the new AB and

overlap length with the greatest overlap.

And when this is done, I just want to.

So the remaining reads, when we're done with this,

will be all of the reads that don't have any.

So in this case we can just join them all together,

it doesn't matter and this will be our shortest common superstring.

>> Like we said in the lecture, as we merge and merge and

merge we might get to the point where there are no more edges, no more overlaps

left in the overlap graph, but there are still multiple nodes left, in which

case we can just concatenate the node labels together to get our super screen.

>> Yeah.

So now let's try running this on simple set.

So I'll just give it a few.

So all these strings overlap by two, so we should be able to get a nice,

short string from this.

So you're going to have a minimum overlap of two, and like you

would expect those strings all overlap by two so it works out very nicely.

All right, so the problem with this is it's not always going to give us

the correct answer, right?

Mm-hm. >> So let's see if we can find an example

where the Greedy method does not give us the shortest string.

So I'm going to get us three strings here,

minimum overlap of one.

>> Okay.

>> Okay, so this is the Greedy method.

And let's run it with the brute force method and

see what the best one for that is.

So you can see on this that the shortest common superstring is actually shorter

than the Greedy superstring, and the reason for this is to

get the shortest common superstring, you can just combine these reads in order, and

each one overlaps by two with the one after it.

>> Mm-hm.

>> But the first string and the third string actually overlap by three.

So the Greedy algorithm will initially merge those two together

into a larger string, and then it has no option but

to just append the middle string to the end since it doesn't have any overlap.

>> Mm-hm.

>> So it actually does a bit worse by doing the Greedy method here.

>> Mm-hm.

Right.

>> But that trade-off in terms of accuracy, in most cases,

will be worth the time you save by doing the Greedy method right?

>> Mm-hm.