我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

Loading...

来自 Johns Hopkins University 的课程

DNA 测序算法

380 个评分

我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

从本节课中

Preprocessing, indexing and approximate matching

In this module, we learn useful and flexible new algorithms for solving the exact and approximate matching problems. We'll start by learning Boyer-Moore, a fast and very widely used algorithm for exact matching

- Ben Langmead, PhDAssistant Professor

Computer Science - Jacob Pritt

Department of Computer Science

In this practical, we'll be implementing approximate matching

using the Boyer-Moore algorithm that we developed in a previous practical.

So to begin with, I've pasted into this notebook all the code from Boyer-Moore.

So we have all of the pre-processing code here and

then the Boyer-Moore function it's self and we're going to use this.

So let's start by creating a new function.

[SOUND].

And this will take as arguments our pattern p that we're searching for,

the text t that we're searching in, and

a maximum number of mismatches that we're willing to allow.

And so we're going to do this using the pigeonhole principle.

So we're going to divide the string p into n plus one segments.

And then at least one of those segments must match perfectly against t.

So let's start by finding the length of each of these segments.

[SOUND].

Just like that, and then we'll create a set that will

fill out with all the indices where we found matches.

Now say for each segment in P.

So for I up to N plus 1, we have to calculate the bounds of P for

the segment that we're searching for.

And the end will just be i plus 1 times the segment length.

But we want to make sure that we don't run past the end of p.

So I'm going to take the minimum of this, and the length of p.

>> So you segment length could either be a little bit longer or

a little bit shorter than a perfectly even length of the different partitions.

So you're just making up for that basically by re-sizing the last partition

so that it makes up for it, right?

Right, because ping might not be,

it's length might not be a perfect multiple of n plus one.

So make sure we don't go past the end of the string.

>> Okay.

So now, let's do our preprocessing, okay, our object.

[SOUND] I'm just going to use the sub string of p that we just calculated,

and we have to pass in the alphabet.

>> That's going to do all our Boyer-Moore pre-processing for us.

So it's going to make the tables for the good suffix rule and

the bad character rule.

>> Yep.

And now we can just run our Boyer-Moore algorithm.

Passing in our preprocessed object and the text t.

And this will return a list of places where that sub string p has matched to out

text t.

So now we have to step through each of those positions and

make sure that the rest of p matches t with no more than n mismatches.

So I'm going to say, for

each position n in matches [SOUND].

So the first thing to test is to make sure that our location doesn't let p run off at

the beginning or the end of t.

So I'm going to say if m is less than star,

four n minus star so it's length if p is greater than length of t.

If either of these is true, this means that p runs off the beginning of t or

past the end of t, so

in this case I'll just say continue to skip the rest of this loop.

And now we need to count the number of mismatches between the rest of p and t.

So I'll first compare the part of p before start.

So from zero to start against the corresponding positions in t.

Say up to range and zero to start.

So increment mismatches, if the character doesn't match.

And if the number of mismatches is ever greater than our maximum n,

then this doesn't work so we're just going to break.

[SOUND] So this will test the part of p before the segment that we already

compared, now we just have to compare the suffix after that segment

I'm going to do the same thing as in the loop above and commit mismatches

if it doesn't match, and if we've found more than n mismatches, then break.

>> So now we've looked to the left and we've looked to the right and

we've accumulated all the mismatches.

>> Yes and now let's just double check that if our number of mismatches

is no more than n, then we can add this to our set.

All matches that we had above.

And we don't want to add position m, we want to add m minus start,

to actually get the beginning of p.

And then when this is all done we just have to return all matches,

I'll convert it to a list.

So I made it as a set, so

that if we happen to find the same starting position from different segments

in p, the set will just store that as one position, it won't store duplicates.

So you won't get more than one of the same coordinate.

>> Yeah for example, you might call this function allowing up to two edits or

up to two mismatches and then p might match exactly somewhere

in which case all three partitions of p are all going to match.

And you're going to add that same element three times.

So by making it a set, we still get back what we want.

Which is just that one result, that one offset where p occurs within t.

>> Yep.

Okay, so now we've done this.

Let's test it, see if it works.

I'm going to make up a p and a t.

So now let's see approximate match p,

t, and we'll say up to two mismatches.

Let's see.

>> What does it say again?

Booster end is not an integer.

And why would that be?

>> Maybe one of those a floating point,

maybe it's saving it like it is because that is floating point division.

>> That's possible.

Let me try just converting it to an int like that.

And see if this works.

>> There we go.

>> Okay.

So it was a problem with, simulate is being stored as a float, but

it needed integer indices for the array.

So I just converted that to an int.

So p matches against t with up to two mismatches in two places.

So you can double check that this is right.

You can see that from position zero we have a mismatch here between a and c.

And the last character is a mismatch between g and a.

You can also, if we print starting at index five and

t is AATTTG so there's just a mismatch with the c there.

So that has one mismatch.

>> Mm-hm.

>> So if we were to do this with up to one mismatch,

only one of these matches should be found.

It should be the second one at five.

We see that's correct.

We no longer get the match at zero because that would be two mismatches.

And if we allow maximum number of mismatches of zero,

we now have no matches because there's no cases where p matches t perfectly.

Cool?