我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

Loading...

来自 约翰霍普金斯大学 的课程

DNA 测序算法

278 评分

我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

从本节课中

Edit distance, assembly, overlaps

This week we finish our discussion of read alignment by learning about algorithms that solve both the edit distance problem and related biosequence analysis problems, like global and local alignment.

- Ben Langmead, PhDAssistant Professor

Computer Science - Jacob Pritt

Department of Computer Science

We've seen dynamic programming algorithms for finding the edit distance between two

strings and also for finding approximate occurrences of one string inside another.

In this lecture, we'll look at a couple of variations on the theme

of dynamic programming for edit distance and we're going to use it to solve two

different problems called global alignment and local alignment.

These are two very commonly solved problems in lots of

biosequence analysis tools and they're both relevant for

our target problems of read alignment and genome assembly.

So let's start with global alignment.

Edit distance penalized all the different kinds of edits that we might find

the same amount, right?

Whether it was a mismatch, or an insertion, or a deletion,

edit distance would penalize that at the unit of 1.

Each one of those things incurred a penalty of 1.

So there is no difference, for example, between the penalty for insertion versus

a substitution, and there is no difference between the penalty for a substitution

that turned an A into a C versus a substitution that turned an A into a T.

So it doesn't always make sense though,

to penalize all these differences the exact same amount.

So what if it turned out, for example,

in practice that gaps were a lot less frequent than substitutions?

Well, then it might make sense to penalize the gaps more than the substitutions.

Or what if it turned out that certain substitutions,

certain base to base substitutions were more likely than others?

Then we might want to penalize them less than those others.

So in fact both of these things are the case.

Some DNA substitutions are more likely than others.

So for example, we can divide all possible DNA substitutions

into two categories called transitions and transversions.

And to understand what these are, we first have to understand that the bases A and

G, adenine and guanine, both belong to a category called purines,

and then the bases C and T both belong to a category called pyrimidines.

And for substitutions that change purine to another purine or

that change a pyrimidine to another pyrimidine,

those kinds of substitutions are called transitions.

And then all other kinds of substitutions are called transversions, and so

if you just enumerate the possibilities you'll see that there are twice as

many kinds of transversion, as there are kinds of transitions.

So you might think that transversions are going to be something like twice as

frequent as transitions.

But it turns out that, in reality,

if you look at the substitutions that differentiate, say, the genomes of two

unrelated humans actually transitions are about twice as frequent as transversions.

So it's the other way around from what you would expect.

So if transitions are so much more frequent than transversions,

it seems like in our penalty scheme we might want

to penalize the transversions more than we penalize the transitions.

Another example is small gaps, so small insertions and deletions in the genome.

So it turns out that if you take the genomes of two unrelated humans, and

you just count up how many differences there are that are substitutions,

versus how many differences there are that are small insertions or

deletions, Insertions or deletions of just a few bases.

You find that the substitution rate is something like 1 in a 1000 bases,

whereas the indel rate, indel meaning insertions and deletions.

The indel rate is something like 1 in 3000 bases or so.

So indels are less frequent than substitutions.

So we might want to penalize indels more than substitutions.

So, we want to move beyond edit distance where every kind of difference

receives the same penalty.

And instead, we want to assign different penalties to different kinds of mutations,

different kinds of events.

So here, for example, is what's called a penalty matrix.

So this matrix has an element for

every kind of penalty that might be incurred in an approximate match.

Some of the elements correspond to substitutions like these yellow and

green elements.

And some of the elements correspond to insertions and

deletions like these pink elements.

But a key point is that we can set the elements of this matrix in whatever way is

appropriate for our biological context.

So, if we're working with DNA and we're comparing the DNA of two different humans,

then we might use a substitution matrix like this where transitions receive

a lower penalty than transversions.

And then transitions and transversions receive a lower penalty than gaps.

So remarkably, we barely have to change our edit distance algorithm at all in

order to accommodate this new penalty matrix.

When it comes to filling in our dynamic programming matrix,

all we really have to change about the algorithm is we have to change the value

that gets added on when we calculate the three contributions, the diagonal,

vertical, and horizontal contributions.

In the case of edit distance, for an insertion or

a deletion, for a vertical or a horizontal move through the matrix,

we just added 1 onto the edit distance of the shorter prefixes.

Now we're going to add a value that we look up in our penalty matrix.

Also, before when we were considering a diagonal move,

we would use our delta function, basically to figure out whether the characters from

x and y were equal, in which case the delta function evaluates the 0, or

whether they were not equal in which case the delta function evaluates the 1.

Now we'll just again,

do a lookup in our penalty matrix to determine what the penalty should be.

Apart from this, the algorithm is pretty much identical to the algorithm for

edit distance.

Both in terms of how we fill the matrix and

also in terms of how we find the trace back.

So that's global alignment.

So here's a global alignment matrix filled in.

Here's a trace back.

It's all very much analogous to what we did with edit distance.

So global alignment is powerful in the sense that it gives the user the ability

to set each of these different penalties according to the biological

problem at hand.

So local alignment, on the other hand,

is a somewhat different kind of problem that's being solved.

Although, surprisingly, it will turn out that the solution

is quite similar still to the solution for edit distance.

So in the local alignment problem, we're not trying to find the distance between

two strings, and we're not trying to find occurrences of one string inside another.

Here, the problem we're trying to solve is,

given these two strings X and Y, we'd like to identify the substring of X and

the substring of Y that are most similar to each other.

And we'll define more clearly later what we mean by most similar to each other, but

first, let me show a quick example to motivate this local alignment problem.

So here are two strings.

They're in English so that the similarities are more obvious to us.

And the question is, can you find a pair of substrings, one from X and one from Y,

that are the most similar?

So looking at this example it seems like these substrings in the middle here

are the most similar.

So they're at hemming one distance from each other but

they have nine matching positions with each other.

So this is approximately the kind of problem we're solving

with local alignment.

Now, in a sense this seems pretty difficult compared to edit distance

because we have to somehow consider all possible pairs of substrings from X and Y.

And if you just consider how many possible pairs there are,

it's going to be related to the product of the squares of the lengths of X and Y.

So that's a lot of pairs that we might have to be sifting through, but

it's going to turn out that actually the mount of work we do is no different from

a global alignment problem of the same size.

So, here's the recurrence for local alignment here.

The differences between this recurrence and

the recurrence we use for global alignment are highlighted here in red.

And instead of using a penalty matrix, we're going to use something similar which

we'll call scoring matrix which is shown in lower right.

And an important difference between the scoring matrix and the penalty matrix is

that in the scoring matrix we're going to give a positive bonus for a match.

And then we're going to give a negative penalty to all the different kinds

of differences.

So in this case,

substitutions are going to have a penalty of -4 and gaps have a penalty of -6.

So now, if we use this recurrence and this scoring matrix to

fill in a dynamic programming matrix we'll get a matrix that looks like this, and

you'll notice that many of the elements are 0.

So, intuitively, this is because the goal of local alignment is to find parts of X

and Y that match closely enough that they sort of pop out from the background of 0s,

from the background of dissimilarities.

And so you'll see, for example here,

the values that are non-0 are mostly over here sort of in this region here.

And if we want to find that local alignment,

the one that gives the best similarity between two substrings,

first we're going to look for the element in the matrix that has the maximal value.

And in this case that's this 12 right here.

That 12 is the maximal value.

And that corresponds to the optimal local alignment.

And then if we want to know which substrings are involved in that alignment,

and exactly what the shape of that alignment is,

we can use our usual trace back procedure.

We'll start in the cell that's labeled 12.

We'll do our usual trace back thing.

The only way that we'll modify our typical trace back procedure,

is that we're going to stop as soon as we reach an element whose value is 0.

So once we've done that,

we've now, first of all we've figured out which are the substrings from X and

Y, that are the most similar, they're highlighted here in green.

So here's the substring from Y, and here's the substring from X, and

we also know the shape of the alignment between those two substrings shown here.

So this is a very useful tool.

Both global alignment and local alignment are very useful tools.

And they're used in many different kinds of biosequence analysis tools including in

realignment tools and assembly tools.