我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

Loading...

来自 Johns Hopkins University 的课程

DNA 测序算法

323 个评分

我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

从本节课中

Preprocessing, indexing and approximate matching

In this module, we learn useful and flexible new algorithms for solving the exact and approximate matching problems. We'll start by learning Boyer-Moore, a fast and very widely used algorithm for exact matching

- Ben Langmead, PhDAssistant Professor

Computer Science - Jacob Pritt

Department of Computer Science

Up until now we've discussed a few different ways of solving the exact

matching problem.

Naive exact matching, Boyer-Moore, and then index assisted offline methods.

But exact matching isn't quite what we want.

The read will not necessarily match the genome exactly, and it's point of origin.

There are a couple of reasons why we expect there to be

differences between the read and the reference genome.

So, one of the reason we might expect differences between the read and

the reference is because of sequencing errors, like we discussed before.

Sometimes the sequencer will make mistakes.

It will miscall a base in the sequencing read.

And when that happens, that base might no longer match the reference genome.

Another reason we might see differences is because the reference genome is not

the same as the genome that we're studying.

Before we said that if you have two unrelated humans,

then their genome sequence are something like 99.8 or 99.9% similar.

But not identical, which means that every once in a while,

there will be a difference between a base in the sequencing read, and

a base in the reference genome.

So that's another reason why we expect there to be differences.

So algorithms for exact matching are not going to be sufficient.

We need algorithms that can do approximate matching.

Allowing for differences between the pattern and the text.

So, what kinds of differences might we encounter?

Well, here's an example.

Here's an alignment of P to T.

In this alignment we see one difference.

This particular kind of difference is called a mismatch or a substitution.

Here is another kind of difference.

This time the difference is that there's an extra character in P relative to T.

The dash that you can see at the top in red is called gap, and

so this G, that's highlighted in red,

is an insertion in the pattern P with respect to the text T.

Equivalently, we could think of this as a deletion in a text T,

with respect to the pattern P.

Here's another kind of gap we can find, like this.

This is just the mirror image of what we saw before.

So this is an insertion in the text T with respect to the pattern P.

Or the other way around, a deletion in the pattern P, with respect to the text T.

We want to be able to talk about the distance between two strings.

In other words, we want to be able to describe how different they are,

how many differences there are.

But we have to define exactly what we mean by distance.

So the first kind of distance we'll define is called hamming distance.

So if you have two strings, X and Y, that are of the same length,

we can define the hamming distance between X and Y as the minimal number of

substitutions we need to make to turn one of the strings into the other.

So in this example we would need, let's see,

one, two, three substitutions.

To turn X into Y or vice versa.

So we would say that there's a hamming distance of three between these

two strings.

Now here's a slightly different definition of distance.

This is called edit distance, or sometimes it's called Levenshtein distance.

The edit distance between two strings equals the minimal number of edits

required to turn one string into the other.

Where a single edit could be a substitution, or

it could be an insertion or a deletion.

So let me point out two important differences between edit distance and

Hamming distance.

So first of all, we're not limited just to substitutions in this case.

So with edit distance we have the option of using insertions and

deletions as well as substitutions.

The second difference is that X and

Y are no longer required to be the same length for edit distance.

This makes sense because the only way we can make one string longer or

shorter is by using insertions or deletions.

Here's an example.

What's the edit distance between this X and Y?

Well, if we look, we can see a substitution here.

And then if we look a little further along, we might notice that this string of

As here has one more A than this string of As here.

So that means there are two edits.

One substitution, and then one insertion in X, with respect to Y.

So we would day that the edit distance is 2.

Here's another example.

So what's the edit distance between this X and Y?

This example is a little harder.

It turns out that the edit distance, again, is 2.

But when you looked at the original X and Y,

it probably looked like the edit distance was going to be greater.

We'll come back to this point in a later lecture when we talk about algorithms for

finding edit distance.

But for the rest of this lecture, and for the rest of the lectures that relate

to approximate matching, we'll stick to hamming distance and to substitutions.

We'll worry about insertions and deletions later when we talk about edit distance.

So, here's our naive exact matching algorithm that we saw before.

Let's say that we wanted to adapt our naive exact matching algorithm

to allow mismatches.

In other words, we want to look for occurrences of the pattern P

within the text T that are within some hamming distance.

So exact matches are within a hamming distance of zero.

But we also want to allow for

matches that are within some hamming difference greater than zero.

So can we modify this algorithm to do that?

Yes, it's actually pretty simple.

The main thing that we have to do is prevent

the inner loop from immediately giving up as soon as we encounter a single mismatch.

Instead, we'd like the inner loop to keep track of how many mismatches have been

seen so far, and then only when we exceed the maximum number of mismatches allowed,

do we break out of the inner loop and give up on that alignment.

So here's an implementation of this idea,

and we'll see this implementation again in an upcoming practical session.

So we can easily adapt the naive exact matching algorithm

to the approximate matching setting.

But if you think about,

I think you'll agree that it's quite a bit harder to see how we can adapt, for

example, Boyer-Moore, to the approximate matching setting.

So what we'll study next though,

is a method that does allow us to do this in general.

It's a method that surprisingly will allow us to use any of the exact matching

methods that we've seen so far as a tool to solve the approximate matching problem.