我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

Loading...

来自 Johns Hopkins University 的课程

DNA 测序算法

303 个评分

我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

从本节课中

Preprocessing, indexing and approximate matching

In this module, we learn useful and flexible new algorithms for solving the exact and approximate matching problems. We'll start by learning Boyer-Moore, a fast and very widely used algorithm for exact matching

- Ben Langmead, PhDAssistant Professor

Computer Science - Jacob Pritt

Department of Computer Science

We're looking for a method that allows us to take the exact matching algorithms

that we've studied already, but apply them to approximate matching problems.

So let's say we start with our pattern P, and the first thing we do is divide it

up into two pieces, two halves, which are labelled u and v here.

So u and v are two non-overlapping substrings that cover P,

basically just halves of P.

And I'll refer to these sometimes as partitions, so u and

v are two partitions of the string P.

The idea is this, if P occurs in a text T with one difference, with one edit for

example, then either u or v must appear exactly, without any edits.

In other words, the edit can change the sequence of u with respect to t so

that it doesn't match anymore.

Or the edit can change the sequence of v so that it doesn't match anymore.

But it can't change both, so one of them will still have to match exactly.

Which means that we can start to solve this approximate matching problem by using

an exact matching algorithm, any exact matching algorithm, to search for

occurrences of u and v within the text T.

So that's the one edit case.

What about the general case where we might want to allow more than one edit?

So let's say we want to allow, say k edits.

We can use something like the same principle but

instead of having two partitions of the pattern P, we have k + 1.

So the new principle is if P occurs in T with up to k edits,

then at least one of these k + 1 partitions of

P must appear with zero edits, with no edits.

In other words, it'll occur exactly within the string T.

So, this is a general principle now.

So if we want to find occurrences of P with up to say,

three edits, we can set k equal to 3.

We can divide P up into four partitions.

And then we can use some exact matching algorithm, maybe Boyer-Moore,

maybe an index-assisted algorithm, to look for

the occurrences of those four different partitions.

If we want k to be 10, then we can divide P up into 11 partitions, etc.

So why does this principle hold?

Well basically it's because if you have k edits and

you go about distributing them across these k + 1 partitions,

every time you put down an edit, you're changing the sequence of the partition so

that it doesn't match anymore.

But if you only have k edits to use, and there are k + 1 partitions,

then you can't change all k + 1 of them, you can only change up to k of them.

So here's a specific set of examples where I'm using these blue boxes to

represent the partitions and these red Xs to represent the edits.

So in the very first case at the top, I have five partitions and

I'm putting down four edits.

And the way I've put them down, three of the edits occur in one partition and

then one of the edits occurs in another partition, but

then three partitions are unaffected.

There aren't any edits in those partitions.

In the second example, I've put down four edits again, but

now they're spread across three different partitions.

So now two partitions don't have any edits in them.

The third case on the bottom is, in some sense, the worst case.

It's the case where the edits are distributed as evenly as

possible across the buckets.

So in this case the four edits are each in a different partition.

But even so, because we have five partitions but

only four edits, there's still one partition, the second from the right,

that's not affected by any of the edits.

So the principle that we're using here is a little bit like

the pigeonhole principle, which you may have heard of before.

So, the principle goes something like this, if you have ten pigeons and

you're going to put them into nine pigeonholes then at least one pigeonhole

is going to have more than one pigeon in it.

So, in this example the upper left hand pigeonhole has two pigeons in it.

Actually, our situation is a little different.

It's as though we have eight pigeons and nine pigeonholes.

In which case, we know that when we put the eight pigeons into the nine holes,

at least one of them's going to be empty.

In other words, at least one partition is not going to have any edits in it.

So, here's an illustration that shows the whole process.

Let's say we've divided P into five partitions,

perhaps because we're looking occurrences of P that have up to four edits.

And next thing that we'll do is we'll use an exact matching algorithm,

any exact matching algorithm,

to find exact occurrences of each of these five partitions within T.

So let's say the only match that we find is a match of partition P4, right here.

We find partition P4 occurring within T.

So that's our hint that there might be an approximate match

of P to T in the neighborhood of this partition match here.

So the next thing we have to do, is we have to have a verification step.

Just like we talked about before when we were using the index, and

we were querying the index with a k-mer extracted from the query.

We had to verify that wherever the index told us was an index hit

that the entire pattern P occurred in the vicinity of that hit.

Just like that here.

When we have a partition that matches exactly within T

we have to use a verification step to determine whether the entire string

P occurs within the neighborhood of that partition hit.

So that's what we're doing here.

So that's how we would apply this principle.

And, just to summarize, this principle, which we'll call the pigeonhole

principle since it resembles that example with the pigeons and

the holes, this is the bridge that allows us to use exact matching algorithms.

All the ones we've learned so far, you know,

the naive exact matching, Boyer-Moore, index assisted algorithms.

We can use those algorithms to find

approximate matches by dividing the pattern P up into partitions, looking for

each of those partitions using an exact matching algorithm.

And then, wherever we find an occurrence of one of those partitions,

we can do a verification step where we look in the neighborhood to see if we see

a full occurrence of P with up to the maximum number of differences allowed.