我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

Loading...

来自 约翰霍普金斯大学 的课程

DNA 测序算法

278 评分

我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

从本节课中

Algorithms for assembly

In the last module we began our discussion of the assembly problem and we saw a couple basic principles behind it. In this module, we'll learn a few ways to solve the alignment problem.

- Ben Langmead, PhDAssistant Professor

Computer Science - Jacob Pritt

Department of Computer Science

In the previous lecture and practical, we introduced a computational problem called

the shortest common superstring problem where given a set of input reads,

we looked for the shortest string, which is called a superstring,

that contains all of the input strings and sub-strings.

So this formulation had a downside though, which is that there were really no

efficient solutions to solving this problem.

So we saw a solution but

it was very slow because it involved us trying every possible ordering for

n different input strings and their n factorial such orderings.

So in this lecture we'll see an alternative that's actually much faster.

Its called greedy shortest common superstring.

And we call this algorithm greedy, because the algorithm will make

a series of decisions and at each decision point, it will choose the option that

reduces the length of the eventual superstring the most.

This seems like a good strategy the shorter the superstring the closer

we are to the shortest common superstring.

However as we'll see making the greedy decision at each point in the algorithm

does not necessarily mean that we'll get to an optimal solution.

We can visualize the greedy shortest common superstring algorithm

using an overlap graph.

So let's start with this overlap graph here.

So remember that the nodes of the overlap graph correspond to reads and

the edges of the overlap graph correspond to overlaps or

suffix prefix matches between pairs of reads.

Each edge here is labeled with a number that gets the length of the overlap

The length of the suffix prefix match.

So, the principle behind the greedy shortest common superstring algorithm

is that we're going to proceed in rounds.

And in each round, we're going to pick the edge that represents

the longest remaining overlap in the graph.

In other words,

we're going to pick the edge that has the greatest number as it's label.

And then we're going to merge the nodes on either side of that edge.

Picking the longest overlap seems to make sense because the longer the overlaps

between the strings, the shorter the final string will be.

So that's what we mean by greedy algorithm in this case.

We're always going to pick the longest overlap,

trying to get to the shortest superstring.

So let's see this principle in action.

So here we have an overlap graph and so to apply our principle,

we must first locate the edge that corresponds to the longest overlap and

well, in this case, there's a tie, right?

Because there are actually several edges that have an overlap length of two,

which is the maximal overlap length in this case.

So when this happens we just pick one of the edges at random.

So let's say that we pick this edge here, highlight it in red.

So now we're going to merge the nodes on either side of this edge.

We're going to merge AAA and AAB.

So when we merge nodes we replace them with one new node that corresponds to

the two labels of the original nodes glued together according to the overlap.

In other words if we took the strings and these nodes and

made a superstring based on the longest overlap between them,

that's the label that we'll put in the new node.

So this merge, there, we just did it.

So this merge has made the graph a little bit simpler, a little bit smaller.

It has one less node in it and it has at least one edge in it,

and actually we might remove several edges when we merge two nodes.

So now what do we do next?

We do the same thing again.

So again we're going to pick the edge with maximal overlap length.

Again there's a tie, so we're going to pick one at random.

So let's say we pick this one, highlight it in red,

and then again we'll merge these two nodes.

So we'll merge ABB and BBB to get ABBB.

So here's our new graph.

And we'll do it again, so now let's pick another node with maximal overlap.

Or, sorry, edge with maximal overlap.

Let's say we pick this one, merge, okay then, pick this edge, then merge, and

now we're done.

So the answer is sitting right in front of us.

Once we've done all that merging, there's just one node left.

And this node's label is the superstring that we've obtained from

greedy shortest common superstring.

If there happened to be multiple nodes at the end of this merging process,

in other words, if after one merge we've run out of edges, but

there are still multiple nodes, then we can create the superstring just by

concatenating the labels of those nodes together.

So this process is much faster than the algorithm we discussed in the previous

lecture and in the practical.

For that algorithm we had to try every possible ordering of the input strings.

But for this one, the amount of work we do is greatly reduced.

So we'll see examples in the upcoming practical session

where our greedy algorithm can finish very quickly,

much more quickly than our original algorithm that tried all orderings.

But this speed comes at a cost and

the cost is that the greedy algorithm doesn't always find the correct answer.

So that is the greedy algorithm can sometimes report a superstring,

a string that does contain all of the input strings, but

it's not necessarily the shortest common superstring.

So let's look at an example where the result of the greedy shortest

common superstring algorithm returns a superstring but not the shortest one.

All right, so this time rather

then drawing the overlap graph, we'll visualize what happens in this way.

So we'll simply write out the labels of all the nodes on a line like you see here.

So this first line here has the labels of all five of the nodes that we saw

in the previous overlap graph.

And then for each round of the algorithm we'll write a new line.

And then the new line will reflect

the merging that happened in the previous step.

So, for example, the line that you see here

is the result of having merged these two nodes from the previous step.

And the nodes that get merged I'm always going to highlight in red.

So if we repeat this process again and again until we get to the end of

the algorithm, we get to this answer that you see down here.

This is the result of the greedy shortest common superstring algorithm in this case.

There were ties along the way.

There were cases where we had a tie in terms of which

nodes we could glue together, in terms of which overlaps had the greatest weight.

And we broke them in a certain way, and

it's a little bit different from the way we broke the ties before.

And as a result, we actually ended up with a different superstring.

And in fact, in this case, if you look at this line here,

here are two strings that have no overlap with each other.

There's no suffix, prefix match between these two strings.

So, this was a case where if we'd drawn out the graph,

we would have seen that the graph had two nodes left and

zero edges left at the end of the repeated merging of nodes.

So we concatenated them together to get the final superstring.

So is this the shortest common superstring?

It turns out that no it's not, right.

So we can show this easily enough because the superstring that we got from

the previous run of the algorithm, where we made slightly different choices about

which edges to merge, this superstring is actually shorter

than the one that we obtained when we ran the algorithm this time.

So we did not get the shortest common superstring

running the greedy shortest common superstring algorithm.

So now we've seen the greedy algorithm and we've seen that while it's faster than

the slow algorithm that we tried before, it has the problem that it doesn't

necessarily return the shortest common superstring.

It might return a super string that's longer.

In the next lecture we'll examine another down side of both common super sting and

the greedy version of that algorithm

which will bring us to the third law of assembly.