我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

Loading...

来自 约翰霍普金斯大学 的课程

DNA 测序算法

278 评分

我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

从本节课中

Preprocessing, indexing and approximate matching

In this module, we learn useful and flexible new algorithms for solving the exact and approximate matching problems. We'll start by learning Boyer-Moore, a fast and very widely used algorithm for exact matching

- Ben Langmead, PhDAssistant Professor

Computer Science - Jacob Pritt

Department of Computer Science

We discussed how to build and query a k-mer index, an index that's built by

taking all the k-mers of the text T and adding them to a data structure that maps

each k-mer to a list of all the offsets where it occurred in the text.

So this kind of data structure is called a multimap.

It's a map because it associates keys, k-mers,

in this case with values, offsets in the genome.

And it's a multimap because a k-mer may be associated with many different offsets

in the genome, right?

A k-mer could occur many places within the genome, within the text.

So what kind of data structures can we use to implement a multimap?

So we'll discuss two of them here.

The first one is based on ordering, like the index of a book, and

the second one is based on grouping, like the aisles of a grocery store.

So let's talk about the first data structure, which is based on ordering.

So here at the top of this slide, I have a key-value pair.

The key is a 3-mer from the text T.

It's the very first 3-mer.

And the value is the offset where that 3-mer occurs.

So we make a key-value pair for every 3-mer at every offset within T.

Here we go.

There they all are.

And then finally, we order them.

We put them in order alphabetically by 3-mer like this.

So now, this is our index data structure.

It's simply a list of 3-mer offset pairs ordered alphabetically by 3-mer.

So now how do we query this index?

Say we have a pattern P from which we extract a 3-mer, and

we'd like to query the index with this 3-mer.

And the way we're going to do this is with something called binary search.

So, what is binary search?

So, going back to our index analogy for

a moment, let's say we're looking up a term in this index.

We're looking up the term memory.

And, so we flip to the exact middle of the index and we look in the middle.

And we find the key term that's there, and

let's say the term directly in the middle of the index is light.

So light comes alphabetically before memory, so we know that we can completely

ignore the first half of the index, up to and including the term light.

So memory must be in the second half, so then what do we do next?

Well we do the same thing but just for the second half of the index.

So in this way we can keep going iteratively throwing away half the index

each time for each iteration.

And eventually we can hone in on exactly the term that we are looking for.

So this is called binary search.

So let's see an example using our index.

Let's say this is our pattern p from which we extract the 3-mer TGG.

And first we take TGG and

compare it to the 3-mer that's in the middle of the index.

The one that's alphabetically in the middle which in this case is GTG.

Our query is alphabetically greater than that.

TGG comes after GTG.

So we can ignore in the index, every entry up to and

including GTG, and we've effectively divided the problem in half.

And each time we divide the problem in half in this way,

we call this a bisection.

Session. So what do we do next?

We do the same thing again, but on the remaining entries of the index.

So we can compare TGG with the middle k-mer here which is TGC.

And TGG comes after so we can bisect again.

We can throw out the first half.

And finally, we can compare TGG to the middle element that we have here, but

there's only two elements left.

So let's just say we compare TGG to the first one and we find that that's a match.

So that match corresponds to an index hit.

We wanted to find offsets where TGG occurs.

We've now found the first one in the index, and it occurs at offset 7.

So, the total number of bisections that we need to perform

in order to find our key in the index is approximately

equal to the logarithm base 2 of the number of keys in the index.

Why the logarithm base two?

Well, because for each of our bisections,

we're repeatedly dividing the problem in two.

So we'll implement this idea and some related ideas in Python later.

But before we move on, let me make a final point, which is that Python actually

provides a set of functions related to binary search that are useful for us here.

So these functions are all in a Python module that's called bisect.

And a Python module if you haven't encountered it before is just a collection

of related Python functions and classes.

And this module is about binary search.

So one function in this module is called bisect_left and

this function takes two parameters.

The first parameter a is a list that's already sorted.

It's already ordered in ascending order.

So if this is a list of strings then the strings should already be in

alphabetical order.

Or if this is a list of integers then they should be in ascending order.

And the second parameter x is an item and the function returns the offset where

the item x can be inserted into the list a while maintaining the order of the list.

So it's the leftmost position where x can be inserted in a such that

a is still in order after that insertion occurs.

So, bisect left is a useful function to us.

In this example we call bisect left with the parameter a,

which is this list up here.

And then the argument 2 and so bisect left is going to return the left

most index where we can insert 2 into this list such that the list is still in order.

And that's offset one, right here, so if we wanted to insert 2 into this list and

have this list still be in order we would put it right here between this 1 and

the first 3.

We would stick it right here and then we would shift all these entries over by one.

Here we're calling bisect_left with the parameters a and 4.

Bisect_left in this case returns the offset 3.

That means if we wanted to stick 4 in this list,

we would do it between here and here.

So we'd do it between the 3 and the 6.

And then in this final example we're calling bisect_left with a and

the parameter 8.

And it's telling us that if we want to insert a,

8 into this list, then the leftmost place we can insert it such that the list is

still in sorted order is at offset four.

Now we could have also inserted it here at offset five or here at offset six.

And the list would still be in sorted order, but

bisect_left is always going to return the leftmost offset where we can insert it so

that the list is still in sorted order.

So this bisect left function is exactly what we need in order to do queries of

this index data structure that we built.

So for example if we're querying with this pattern here and

we take some 3-mer from the pattern, let's say we take the threemer GTG.

We can use the bisect left function on our index and the parameter GTG.

And it will return the offset of the first position where we could insert GTG

while maintaining sorted order of the list.

And that corresponds to this entry right here,

which is the first entry that has GTG as its key.

So if we wanted to find all the places where GTG occurs,

we could use bisect_left.

It would point as here, and then we could look up this offset and then, keep going.

Find another GTG and find its offset.

And then, keep going and find another GTG and find its offset.

So we would report that GTG occurs at offsets zero, four and six.