我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

Loading...

来自 约翰霍普金斯大学 的课程

DNA 测序算法

279 评分

我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

从本节课中

Algorithms for assembly

In the last module we began our discussion of the assembly problem and we saw a couple basic principles behind it. In this module, we'll learn a few ways to solve the alignment problem.

- Ben Langmead, PhDAssistant Professor

Computer Science - Jacob Pritt

Department of Computer Science

We discussed two computational formulations of the assembly problem and

algorithms for solving it, but

we had to admit that those algorithms had some pretty serious problems with them.

So now, let's talk about how real software tools for

solving the assembly problem work.

And how they deal with repetitive genomes and other issues in practice.

So we talked about two different kinds of graphs, overlap graphs and

De Bruijn graphs.

These were used in different algorithms.

The overlap graph was used in shortest common superstring, and

the De Bruijn was used in the Eulerian walk algorithm.

And so even though we won't be using shortest common superstring or

Eulerian walk, we're still going to use these graphs, and

the different graphs correspond to two different categories of assembly programs.

So assembly programs on the one hand that use the overlap graph

are sometimes called overlap-layout-consensus programs.

And on the other hand, programs that use the De Bruijn graph are sometimes called

De Bruijn graph-based programs.

So in both cases, though, the first step is to build the graph, and

we've already seen how to do this for these two types of graph.

Once we have the graph,

well, we're not as close to having the final answer as you might hope.

The graph is going to be big and messy and

here's an example of a big, messy De Bruiijn graph.

This graph is going to be a,

a lot messier than all of the tiny examples that we've seen so far.

And the walk through the graph that corresponds to the correct

assembly is not going to be obvious at all.

So why is the graph so messy?

Well, there's a few reasons, so first of all we have sequencing errors and

sequencing errors can introduce sort of spurious little diversions in the graph,

little dead ends.

And one of our jobs, if we want to know the true reconstruction of the genome,

is to get rid of or otherwise ignore these dead ends.

So here, for example, is a dead end that's caused by one sequencing error.

And then also these little disconnected bits of the graph off floating around,

those are also due to sequencing errors.

So another reason the graph can be messy, and this point is specific to the overlap

graph, is that it can contain edges that, while they're not incorrect,

they don't actually tell us anything that we didn't know already.

What do I mean by that?

For example,

here we have a small overlap graph, and now we'll highlight a few edges here.

These three edges, three overlaps here at the top, and

two are in blue and one is in green.

And now this is an overlap graph so the notes correspond to reeds and

the edges correspond to overlaps between the reeds and

each edge is labeled with the length of the overlap.

So if you look at these three edges, and

stop to think about what they mean, you'll realize that the presence of

those two blue edges implies the presence of the green edge.

In other words, if you just told me that those two blue edges were there,

I could figure out that the green edge was also there.

The blue edges are saying that there's an overlap of six between these two reads,

and an overlap of six between these two reeds.

And all the reads are of length seven.

And when you put those facts together,

you realize that there must be an overlap of length five between these two reads.

So the green edges, in some sense, implied by the two blue edges.

Or another way you can say that is that the green edge is transitively inferable

from the blue edges.

And then you can see another example down here.

This green edge with the label 4 is transitively inferable by the three

edges that are labeled 6.

Another reason the graph can be messy is because of polyploidy.

So an example of polyploidy that we're familiar with is the human genome.

And so for each chromosome of our genome, we know that we inherit two copies,

two versions of that chromosome.

One from our mother, and one from our father.

And for some positions in the genome,

some bases in the genome, you'll find that the version inherited from the mother and

the version inherited from the father are not the same.

There's a different base inherited from the mother and the father.

Like in this example here, the blue and the red base.

So now let's picture what the corresponding De Bruijn graph would

look like.

It's going to look like this.

And they key point here is that the two bases

present at this position in the middle of this stretch of DNA.

The base that varies between the two parents,

results in this kind of bubble shape that appears in the De Bruijn graph.

This bubble here.

And the nodes and edges along one edge, one side of this bubble correspond to

the camers that are only present in the maternal copy,

because they overlap the base that comes from the mother.

And then along the other branch, are the camers that are only present

in the paternal copy because they overlap the base that came from the father.

And so this is not a mistake.

This is not like a sequencing error.

This is a real base that just happens to be present in two different copies.

And so one way to deal with this is to collapse the graph,

get rid of the bubbles so that there's just a straight line walk.

But then keep a note that says, oh, by the way.

At this position I noticed that there's a different base in the mother and

in the father.

And then one last way that the graph can be messy is because of repeats,

because of repetitive DNA.

And as we've seen, what repeats do is they'll create ambiguities in the graph.

And if we leave these ambiguities in the graph and

then try to reconstruct the genome sequences, then we'll probably make

mistakes, different mistakes depending on what method we use.

For example, shortest common superstring was vulnerable to

overcollapsing repeats and then the De Bruijn graph Euler and

Walk technique was vulnerable to shuffling around.

Bits of the genome in between the repetitive bits.

So either way, we got the wrong answer, so how do we deal with this?

Basically we deal with it by chopping the assembly into pieces.

What do I mean?

Even if the graph as a whole has ambiguities in how the genome should be

reconstructed, There will be pieces of the graph for which there's no ambiguity.

For this piece of the graph, it's clear what the reconstruction is.

So maybe we can't put the whole puzzle together, but

we can still put portions of the puzzle together.

So for example,

here is an overlapped graph, where the overall reconstruction is not clear.

But this part here is not ambiguous.

We know how to get through that part of the graph.

We know how to reconstruct that part of the genome.

And again, there's only one way to walk through this part of the overlap graph.

And this part in the middle, we know what sequence is there but

we don't know how many times to walk through it.

And so these partial reconstructions that we can put together unambiguously

are called contigs, which is short for contiguous sequences.

So, this is a principle that real world assemblers use when the overall

graph is ambiguous, which it almost always is because of repetitive DNA.

We cannot reconstruct the original genome unambiguously,

that's what the third law says.

So instead what we do is to identify portions of the graph that

can be reconstructed unambiguously and assemble those.

And so at the end of the day instead of reporting a single assembled genome

sequence an assembler will report a set of contigs.

That's sort of the best we can do.

Nearly every assembly that's been produced, with the exception of

pretty simple genomes, like bacterial genomes, for example.

Every assembly is fragmented, it's in pieces.

Even the human reference genome,

which is the most studied genome on the planet, still has holes in it.

So for example this figure is showing a result from a recent effort to

fix problems with the human reference genome.

And everywhere where you see a red diamond, so

these are the chromosomes here, labeled with what chromosome they are.

So here's chromosome 16.

And everywhere where you see a red diamond,

like this red diamond here or this red diamond here,

that's a place where a gap in the assembly was fixed as part of this study.

So, that's a place where the space between two fragments was filled in so

you could merge the two fragments.

But, there are many, many, even after efforts like this,

there are many gaps that remain in the human genome assembly.

That's just something we have to live with because of repetitive DNA.