我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

Loading...

来自 Johns Hopkins University 的课程

DNA 测序算法

368 个评分

我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

从本节课中

Algorithms for assembly

In the last module we began our discussion of the assembly problem and we saw a couple basic principles behind it. In this module, we'll learn a few ways to solve the alignment problem.

- Ben Langmead, PhDAssistant Professor

Computer Science - Jacob Pritt

Department of Computer Science

We just saw how to create a de Bruijn graph, and

we also learned what an Eulerian walk is.

It's a walk through the graph that goes from node to node and

that crosses each of the edges exactly once.

And we said that this corresponds to a reconstruction of the original genome, and

so it gave us a new algorithm for

reconstructing the sequence of a genome from a collection of sequencing reads.

So let's apply this idea in a couple of scenarios, and

let's see how it works in a couple of more complicated examples.

And we also want to see whether it actually solved the issues that we

had with the shortest common superstring formulation.

For example, the shortest common superstring had a tendency

to overcollapse repeats, so we're going to see if we have the same problem here.

Here's a de Bruijn graph, for a familiar example.

This is the genome that the shortest common superstring had trouble with.

So we collapsed the three copies of the word long into just two copies.

Well, when we draw the de Bruijn graph for this string, for this genome,

using a camera length of five, and assuming that the sequencer sequences each

camera exactly once, then we end up with the picture that you see here.

We said previously that de Bruijn graphs produced in this way are Eulerian,

so they have an Eulerian walk, and so if we look at this graph we ought to be able

to find a walk through it that reconstructs the original sequence.

And we can.

So here it is.

We start here, and we spell out

a long, long, long time.

So there you go, that's the Eulerian walk through the graph.

So again, the Eulerian walk did what we said it should.

It gives us the reconstructed genome sequence back again.

But the point is that it, actually in this case, did not overcollapse that repeat.

It did not overcollapse the repeat long_long_long.

It gave us the original sequence back, and that's an improvement over the shortest

common superstring formulation of the problem.

So that's great.

But let's not be lulled into thinking that we've solved all our problems.

So we actually can't escape from the third law of assembly which states that

repetitive portions of the genome make assembly difficult.

And let's see exactly how they make assembly difficult when

we use an algorithm that's based on the de Bruijn graph, like we are here.

So here's another example, and this time things won't be quite so straightforward.

So the de Bruijn graph that you see on the right is the graph that corresponds to

the genome on the left using a camera length of three, and

as always we're assuming that each camera is given to us exactly once.

And this graph on the right is Eulerian like we said before because there's a way

to walk along the nodes of the graph such that we cross each edge exactly once.

However, there is more than one way to do so,

so there's more than one Eulerian path for this graph.

So one of those ways looks like this.

Let's start in ZA, and we're going to start by going down to AB.

And then we can go up to BE, and then along here,

and then down here and along here, and then down to BY.

And that's one of the Eulerian paths.

The other one goes like this.

We start from ZA, we go to AB, and we start by going down to BC and

going around like this.

Then we go up to BE and go around like this.

Then we go down to BY.

So here are those two walks that we just took,

the series of nodes that we visited along those walks.

So we found two different ways of

walking along our de Bruijn graph in order to try to reconstruct the original genome.

So there's ambiguity.

We haven't escaped the third law of assembly here.

Instead, the way that it manifests itself is, it means that the de Bruijn graph for

repetitive genomes will, in general, have many different Eulerian walks, and only

one of those walks actually corresponds to the original genome sequence.

All the rest of the walks correspond to incorrect reshufflings

of the genome where portions of the genome that occur between repetitive sequences

are going to get shuffled around and put in the wrong order.

So in this case, again, repeats made assembly difficult, and

specifically the repeat that gave us trouble was the fact that this,

the tumor AB appeared in three different places in our genome sequence.

Okay. So let's see a slightly larger example of

essentially the same idea.

So, here I'm taking, I'm showing some Python code that takes a genome.

The genome's shown at the top in blue.

It's in English.

And then we build a de Bruijn graph using a camera length of four, and

then we call a function of the de Bruijn graph class here which is called

eulerianWalk which simply finds the Eulerian walk and

just returns basically a list of all the nodes visited during that walk.

So, we won't have time to talk about it here, but

the implementation for eulerianWalk is actually very simple.

So eulerianWalk is a very simple and efficient function.

So, this next line concatenates together the labels of all the nodes visited

over the eulerianWalk, which is a very simple thing to do.

It respects the overlaps between them so

it actually starts by appending the first node label, and

then it appends one more character for each subsequent node visited.

And then it simply prints out the superstring that it's put together.

So in this case,

you can see that the answer that we get is actually the real genome.

So this is the genome that we started with.

So we get the correct answer back.

So that's great.

So now let's repeat this entire experiment but

using a camera length of three instead of a camera length of four.

And by decreasing the camera length, we're increasing the chance of being

affected by repeats since the smaller k means there's more likely to be

multiple occurrences of any given k-mer in the genome which is going to increase

the chance that we have multiple different Eulerian walks for the same graph.

And indeed, as you can see all the way at the bottom here,

our reconstruction of the genome is incorrect, so

it mixes up the order of these words that start with T right here.

All right?

So this is the curse of the repetitive genome which

in the case of the de Bruijn graph it results in a spurious reshuffling

of portions of the genome like you can see here.

So now back to one other issue that we've been putting off,

our assumption about the sequencing data.

So we've been assuming that the sequencer is kind enough

to give us exactly one read per k-mer, and

without accidentally leaving out any k-mers, or giving us any k-mers twice, or

anything like that, or without sequencing errors, and things like that.

But of course none of these assumptions are actually true in practice.

The sequencer gives us sequencing reads, which are usually going

to be much longer than the value of k that we pick, by the way.

So typical values of k are around 30 to 50 or so, whereas sequencing reads

are usually around 100 to 200 or tp 300 or so bases long.

But of course, the sequencer does make mistakes.

There are sequencing errors.

The coverage is uneven.

It's not necessarily going to give you the desired number of copies of any particular

bit of the genome.

So given this reality,

we can certainly still use our procedure for building the de Bruijn graph.

In other words, we can take each read and chop each of the reads up into k-mers,

and then for each k-mer split it into it's two k minus 1-mers and

add the corresponding nodes and edges to the graph.

We can still do that,

but what we end up with is a graph that's not necessarily Eulerian.

In fact, it'll never be Eulerian in practice.

So, for example, if we have this genome, and

the sequencing reads reported by the sequencer are the three

reads that you can see here, and we're going to use a k-mer length of eight.

So when we chop those reads up into k-mers we get the k-mers that you see down here.

So we can still build a de Bruijn graph out of these k-mers.

Right? We can build a de bruijn graph out of any

set of k-mers.

Then we can do that, and this is the graph that we get.

And if you look at this hard enough,

you can figure out that this graph is not Eulerian.

There's no Eulerian walk of this graph, but it's almost Eulerian.

I mean, if you go like this, if you go a long, long, long,

time, you'll notice that it was almost Eulerian.

The problem here was right here,

the number of edges going from this node to this node.

So finding Eulerian walks will not necessarily solve the assembly

problem for us.

Now that doesn't mean what we just investigated was pointless.

So for one thing, it was an instructive example of how repeats make assembly

different no matter what algorithm we use, and we solved two different algorithms in

two different ways that the repeats made the assembly problem difficult.

But, another reason is because this de Bruijn that graph we studied,

the de Bruijn graph representation is actually a very common and useful way to

represent assemblies, to represent the assembly problem, and in fact,

many modern software tools for assembly use de Bruijn graphs as the internal

representation of the relationships between the sequencing reads.

So while shortest common superstring and Eulerian walk are both flawed

formulations of the assembly problem, the overlap graph and the de Bruijn graph

are both going to continue to be very useful to us in practice.