我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

Loading...

来自 Johns Hopkins University 的课程

DNA 测序算法

344 个评分

我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

从本节课中

Edit distance, assembly, overlaps

This week we finish our discussion of read alignment by learning about algorithms that solve both the edit distance problem and related biosequence analysis problems, like global and local alignment.

- Ben Langmead, PhDAssistant Professor

Computer Science - Jacob Pritt

Department of Computer Science

In the previous lecture we talked about overlaps and we said that overlaps

are the glue that's going to allow us to assemble the genome back together.

So in this lecture we'll learn about a way of representing all the overlaps

in one big structure so that we can more effectively assemble the genome.

So here's one overlap shown here, and here's another overlap.

And these two overlaps both involve the same read.

So this read here is the same in this overlap.

This is the same sequence here in this overlap.

So these two overlaps have a relationship with each other.

To represent all these relationships at once, we're going to build something

called a graph, or more specifically, a directed graph.

So if you haven't seen something like this before, it's pretty straightforward.

So here's a very simple example of a directed graph, and

there are two kinds of things in this graph.

There are nodes, which in this picture are drawn as ovals.

And there's an edge, which is drawn as an arrow here.

And the nodes and edges usually have some kind of meaning.

For example, here the nodes have labels that show us that they represent

characters from the play Hamlet.

And specifically, this node on the left is Hamlet and

this node on the right is Polonius.

And the edge also has some meaning.

For example, let's say that the meaning of the edge here,

the edge that points from Hamlet to Polonius, is Hamlet kills Polonius.

The nodes represent objects or concepts of some kind, and

then the edges represent relationships among them.

And this is called a directed graph when the edges have a direction, so

as indicated by the arrow here.

The relationship between Hamlet and Polonius is one way.

We're trying to say that Hamlet killed Polonius, not the other way around.

So this is a directed edge, and

a graph with directed edges is called a directed graph.

And this particular graph is quite small, clearly, so

let's take a look at a larger example of a directed graph.

So this graph is a graph of everybody in the play Hamlet who gets killed.

So everybody who gets killed in the play is a node here in this graph.

And one edge here,

this edge in the middle, is the edge that we saw before on the previous slide.

But then I put in edges for everybody who kills somebody in the entire play.

So for example, Hamlet kills Laertes, Laertes kills Hamlet, blah, blah, blah.

So that's what this whole graph is telling us.

So the concept is exactly the same as the graph on the previous slide.

So we can represent all of the overlapped relationships

in a data set of sequencing reads in the following way.

We can make a directed graph where the nodes of the graph

correspond to the reads in our data set.

And then for a pair of nodes, or for a pair of reads that have an overlap,

where the suffix of one matches a prefix of the other, we'll draw a directed edge.

So the edge will go from the node from the read that has the suffix

to the node that has the prefix that are involved in the overlap.

So that's shown in the example at the bottom of this slide here.

Now let's see an example where we have a complete overlap graph for

an entire data set.

So here we're going to start with a genome, which you can see written in blue

at the top here, and we're going to extract every substring of length 6.

We're going to create essentially a synthetic set of sequencing reads,

a fake set of sequencing reads, just so we can draw this diagram.

So we're going to extract every substring of length 6, or

every 6-mer, from the genome.

And those 6-mers will become the nodes of this graph.

And then we're going to draw an edge for every pair of 6-mers,

every pair of nodes, that have an overlap, and it's going to be a directed edge.

So we have to define here what we mean by an overlap.

Some overlaps are less convincing than others.

So for example, an overlap or length 1 suffix of one read overlaps

a length 1 prefix of another read, that could be just a coincidence, right?

So we probably don't want to include that as an edge in our graph.

We don't want to consider that an overlap.

So it makes sense to have some kind of threshold, and to say that as long as

the overlap is more convincing than this threshold, I'm going to count it

as an overlap and I'm going to draw the corresponding directed edge in this graph.

In this example, we have a threshold that says that we're

going to require that the suffix-prefix match, first of all, be an exact match.

And second of all, it has to have length at least 4.

So this is just for simplicity.

We could just as well have allowed the overlaps to have mismatches and

gaps in them, but here for simplicity we don't.

And we want to make them have length at least 4 to reduce the chance that we get

a not very convincing overlap between two of these reads.

So given that threshold, we can write out the entire overlap graph,

which is what's shown at the bottom of this slide here.

And here the nodes here correspond to reads, and

they're each labeled accordingly, and the directed edges correspond to the overlaps.

And here the edges are also labeled.

Each edge is labeled with the length of the corresponding suffix-prefix match.

So for example, this edge is labeled with a 5 because a length 5 suffix,

ACGTA, matches a length 5 prefix, ACGTA, here.

So you can start to see how this graph could be very

useful as we try to figure out what the sequence of the original genome is.

The sequence of the original genome is actually going to

correspond to a particular way of walking along this graph.

So what do I mean by that?

Well, in this particular example, if you look at this graph for

long enough, you'll realize that one particular path through this graph,

which is the one indicated here in red,

is special because it corresponds to the sequence of the original genome.

What do I mean?

Well, if you start in the first node here, let's say we start in this node.

One thing you'll notice is that this 6-mer, this read, GTACGT,

is the first 6-mer in the original genome, GTACGT, okay?

So let's say we're going to start here.

Then the next thing we might do is we might follow

this edge here which takes us to this node.

And again, we can see, because of how the edge is labeled, that there's a length

5 suffix of this string that matches a length 5 prefix of this string.

So that's TACGT.

And then this node has one more character, A.

And we can see, again, if we look at the original genome,

that the next character after that first 6-mer is, in fact, an A.

And so we can proceed in this same way.

From this node, we can go to this node here.

Again, there's a length 5 overlap between the two, and

there's one more character at the end of this node, which is a C.

And that corresponds to the next character in the original genome, etc.

So if we keep doing this walk from this node to this node and

then from this node to this node and then from this node to this node.

We walked along every 6-mer from the original genome sequence,

and from those we can infer the sequence of the original genome.

This example was very idealized.

In reality, the data that we get and

the resulting overlap graph are going to be a lot messier.

It's going to have sequencing errors.

That polyploidy issue that we discussed before is going to be a problem.

So it's not quite this simple, but

you can start to see how the overlap graph is a very useful structure for us to build

on on our way to reconstructing the original genome sequence.