我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

Loading...

来自 约翰霍普金斯大学 的课程

DNA 测序算法

280 评分

我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

从本节课中

Algorithms for assembly

In the last module we began our discussion of the assembly problem and we saw a couple basic principles behind it. In this module, we'll learn a few ways to solve the alignment problem.

- Ben Langmead, PhDAssistant Professor

Computer Science - Jacob Pritt

Department of Computer Science

The shortest common superstring problem was our first attempt to formulate

a computational problem that when solved would solve the genome assembly problem.

But we have a lingering issue with the shortest common superstring formulation,

which is that if the genome is repetitive, it will tend to over collapse the repeats.

It will assemble fewer copies of the repeats than are actually there.

So in this lecture we're going to start to consider an alternative algorithm.

So whereas before we used something called an overlap graph,

this time we'll use a slightly different graph called a De Bruijn graph.

Like the overlap graph, the De Bruijn graph is also a directed graph.

That is, it has edges and those edges have a direction.

So let me start with an example that's not really a De Bruijn graph yet, but

it gives you some idea of what we're going to be doing here.

So, say that we have the following phrase.

Tomorrow and tomorrow and tomorrow.

And we want to turn it into a particular kind of graph.

So first we'll make one node for

every distinct word that's present in our phrase.

And this is easy because there are only two distinct words in our phrase,

the word tomorrow and the word and.

And then for every pair of adjacent words in our phrase,

we'll add the corresponding directed edge to our graph.

So for the first two words, tomorrow and,

we'll add an edge that goes from tomorrow to and, like this.

And then for the next two words, and

tomorrow, we'll add an edge that goes from and to tomorrow, and so on.

So we'll end up with a graph that looks like this.

So note that this graph can have more than one edge pointing from one node to

another.

So for example, from tomorrow to and there are two edges.

So this is sometimes called a multigraph.

So our De Bruijn graph will be a multigraph.

Okay, so let's switch to DNA, and let's build our first De Bruijn graph.

So first we have to assume something about the sequencing reads that we're getting.

We have to assume that the sequencing reads that we get consist of each

of the substrings of the genome of length k.

In other words, consists of each of the k-mers from the genome.

And that each k-mer is sequenced once.

So for every offset in the genome,

we're going to extract that k-mer and then our sequencing

output is just going to be basically a list of all the k-mers in the genome.

So this is the genome that we're sequencing.

We're assuming that our reads look like this.

Here are all the 3-mers from the genome.

And this is not a very realistic assumption,

but we'll revisit this later on.

So we'll stick with this assumption for now, but

then later we'll talk about why it's not really a great assumption to make.

So now to build the De Bruijn graph.

We're going to examine each of these k-mers,

each of these sequencing reads in turn, and then for each k-mer we're

going to make some corresponding additions to our De Bruijn graph.

For each k-mer we're going to extract its left and right k-minus-1-mers.

So by k-minus-1-mer I just mean a substring of length k minus 1.

So in this example, k equals 3, so a k-mer is a 3-mer, and

then a k-minus-1-mer we'll call a 2-mer.

So here's a 3-mer and here are its left and right 2-mers.

So to update the De Bruijn graph, first we'll add a node for

the left 2-mer, which is AA.

And then we'll do the same for the right 2-mer, except for

that the right 2-mer is also AA.

So we don't need to add a new node, because AA is already there.

Then we add a directed edge that points from the left to the right 2-mer.

But in this case both the left and the right 2-mers are AA, so

we're going to make an edge that points from AA back to AA.

So that's the first k-mer.

Now let's add the second k-mer.

So again, we're going to look at its left and

right 2-mers, which in this case are AA and AB.

We'll add a node for the left 2-mer if needed, but

in this case it's not needed because AA is already there.

And then we'll add a node for the right 2-mer, which is AB.

And then we'll add a directed edge from the left to the right 2-mer,

so from AA to AB.

Let's do the same for the third k-mer, AB BB.

So we're going to have to add a new node for BB.

And then we're going to have to add an edge from AB to BB.

Let's do the same for this next one.

This one's BB BB, so we're going to add a edge that goes from BB back to BB.

Same thing again for this k-mer.

And then for the final k-mer we have BB and BA, so we're going to have to

add a node for BA, and then we'll add a directed edge from BB to BA.

So there we go.

So a few points about this graph that we just built.

So first of all, each k-mer in the genome corresponds to one edge in this graph.

That's how we built it.

For each k-mer we added its left and right k-minus-1-mers,

if they weren't already there, and then we drew one additional directed edge.

So for each k-mer in the original genome,

there's a corresponding edge in our De Bruijn graph.

And also, I'll note that there's a node for every distinct k-minus-1-mer.

So when we were adding the nodes and

edges, if a k-minus-1-mer already had a node in the graph, we did nothing.

But if it didn't have a node in the graph, then we added it.

So at the end of the day, the nodes are going to correspond to all the distinct

k-minus-1-mers from the genome.

So here's a question.

Let's say I hadn't shown you the genome sequence, and

all I showed you was this De Bruijn graph, which was built over the genome's 3-mers.

Could you then go and

reconstruct the sequence of the original genome from this graph?

So just think about that for a few moments.

So it turns out the answer is yes.

So a reconstruction of the original genome corresponds to a walk through the graph.

In other words, we can start at some node in this graph and

then move through the graph from node to node,

following the edges as we go, respecting the direction of each edge.

So we're always going to walk in the direction of an edge.

And we can reconstruct the genome as we go.

So the walk here that allows us to reconstruct the genome

is the walk that crosses each edge exactly once.

Another way of thinking of that is that it uses each k-mer exactly once.

So in the case of this graph here, the walk goes like this.

And it will look familiar because this is also the order we added the edges.

So we'll start here in this AA node.

And we're going to follow this first edge here which takes us from AA back to AA.

And at this point we've reconstructed the first three characters of the genome,

the first 3-mer of the genome.

Now let's follow the second edge.

That allows us to tack on a B to the reconstruction of the genome.

Now let's follow this edge.

We're going to tack on another B, and then we'll follow these two edges.

Both of the edges that go from BB back to BB,

and that causes two more Bs to be appended on.

And then we'll follow this last edge, and append one last A.

So we successfully reconstructed the original genome sequence here.

So here's a bit of vocabulary.

I'll walk through a directed graph that visits each edge,

that crosses each of the edges in the graph exactly once.

It has a special name.

It's called an Eulerian walk.

So not every graph that you can possibly write has an Eulerian walk,

but a graph that does is called an Eulerian graph.

And as it turns out, using the assumptions that we made about the sequencer,

saying that we get each k-mer of the genome exactly once.

If we go ahead and construct the graph according to this procedure,

the graph will always be Eulerian.

It will always have an Eulerian walk,

which is our way of reconstructing the genome sequence.

So this actually gives us a new algorithm for solving the assembly problem.

We take our reads.

We build the corresponding De Bruijn graph,

which is Eulerian as long as we had one read for every k-mer.

And an Eulerian walk through the graph gives us a reconstruction of our original

genome.

So we'll see this idea in action in the following lecture.