我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

Loading...

来自 约翰霍普金斯大学 的课程

DNA 测序算法

279 评分

我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

从本节课中

DNA sequencing, strings and matching

This module we begin our exploration of algorithms for analyzing DNA sequencing data. We'll discuss DNA sequencing technology, its past and present, and how it works.

- Ben Langmead, PhDAssistant Professor

Computer Science - Jacob Pritt

Department of Computer Science

Let's formulate the first real computational problem that we'll address

on our journey studying the read alignment problem.

So it's called the exact matching problem.

We'd like to find all the offsets where some pattern string, which we'll call P,

occurs within a longer string, the text, which we'll call T.

The pattern P is like the needle and

the text T is like the haystack that we're looking in.

So in this example, we have a pattern P, which is word.

And we have a text T, which is there would have been a time for such a word.

And there's one occurrence of the pattern all the way there at the end,

which is at offset 40.

So more specifically,

the leftmost character of T that is involved in that match is at offset 40.

So we say that the match is at offset 40.

And that's the only place where the pattern occurs in this text,

in this example.

So in a way, this problem is a simplified version of the read alignment problem.

It's a bit too simple what as we'll see later.

We'll this in lecture and in practical sessions.

But it's a great place for us to start.

So here's a Python version of the previous example.

We have our text T, we assign that to our variable t.

And we call a string method called find.

And what find will do is, you give it an argument, which is a string,

and it'll return the offset of the leftmost occurrence of

that argument inside the string.

So in this case, the string that we're calling the method on is our text and

the argument that we're passing is our pattern.

And Python reports that the leftmost occurrence is at offset 40.

So if we want to find all the places where the pattern occurs, not just the leftmost

place, we have to do something a little more complicated than this,

but you get the idea.

It's pretty easy to use Python to do this.

But let's not use Python's built-in functionality.

Let's implement exact matching ourselves.

So how do we do it?

And let's not try to be clever at first.

Let's just try to solve the problem in the most straightforward way possible.

So here's a suggestion.

Let's try every possible way of lining up P with T.

So, for each of these ways that we could possibly line P up with T, we're

going to see whether all the characters of P match the corresponding characters of T.

And if they do, then that's a match.

If one or more of the characters differ between P and T,

if there's at least one mismatch, then that offset is not a match.

That's a pretty simple idea.

We're just trying every possible offset and

we're checking whether P matches T at that offset.

So here's an implementation of that idea in Python.

We'll see it again in a practical session that follows this.

This function is called naive,

because this algorithm is sometimes called the naive exact matching algorithm.

And it takes two arguments, the pattern P and

the text T, and we will initialize a list occurrences to be empty.

It's an empty list at first.

We'll eventually fill it with all of the offsets where P matches T.

Now we reach this outer loop here.

Which iterates over all possible alignments,

all possible ways of lining up the characters of P opposite characters of T.

And then, in this inner loop,

we have to determine whether the alignment corresponds to a match or not.

And this requires that we compare the characters of

P to the corresponding characters of T.

So the inner loop iterates over characters of P from left to right.

And for each one it tests to see whether it's equal to

the corresponding character in the text T.

If it is not equal, then we know that P does not match T at this offset.

So in that case, we set this match variable to false.

And we break out of the inner loop.

We can break because once we know that one character mismatches,

we don't have to do anymore character comparisons.

We know that it's not a match.

So we can break out of that inner loop, and that brings us here.

And only if we make it through all the iterations

of the inner loop without ever setting the match variable to false.

In other words, only if all the character comparisons were matches,

do we go ahead and append this offset to the list of occurrences.

And then at the end of the day we return the occurrences list.

So that's it.

Two nested loops.

The outer loop iterates over alignments in left to right order.

And the inner loop iterates over character comparisons for

a given alignment also in left to right order.

So for the next few slides I'm going to ask some questions.

These aren't quiz questions, but I do recommend that once you hear the question,

you might want to pause the video and

try to come up with your own answer to the question before unpausing it.

So let's say that we have a pattern and a text.

And the pattern has a length x, and the text has length y.

So in this case, how many times, in terms of x and y,

how many times will we iterate through the outer loop?

In other words, how many different alignments will we examine?

So the answer is y minus x plus 1.

Here's a related question.

So given a pattern of length x and a text of length y, what is

the greatest total number of character comparisons we could possibly do?

What's the greatest number of times that we can possibly go through,

iterate through, the inner loop of our nested pair of loops.

The answer is y minus x plus 1, all times x.

The y minus x plus 1 again represents the number of possible alignments of

the pattern to the text.

And for each alignment, the most character comparisons we can do is x,

the number of characters in P.

So we have to take their product, x times y minus x plus 1.

This is a kind of worst case analysis.

All right, this case seems awfully bad.

But when would it actually happen?

When would this sort of situation happen?

Well the worst case scenario looks like this.

This is a scenario where every character of P matches every character of T.

So every character comparison results in a match, and

therefore we go through the inner loop the maximum number of times.

Okay, that seems like a pretty rare case, but

it's good to keep in mind that that's the worst case.

Here's another question.

What's the least number of character comparisons that are possible?

So here the answer is y minus x plus one.

Which again is the number of possible alignments of P to T.

Why is that the answer?

Well, for each alignment we're going to do at least one character comparison.

But the very first character comparison could be a mismatch.

In which case, we immediately break from the inner loop after having done

only one character comparison.

So if this happens,

then we're doing exactly one character comparison per alignment.

So the number of character comparisons equals the number of alignments.

So when would this happen, and what sort of a scenario would, for

what kind of pattern and text would this sort of thing happen?

Well this is an example of a case where that would happen.

So in this example, the first character of the pattern

is a character that doesn't occur anywhere in the text.

So the first character of the pattern is a, and yet the text consists of all b's.

So in this case, every time we reach the inner loop, the very first character

comparison is going to be a mismatch and we'll break out of the inner loop.

Okay, so here's a particular example.

Here's a P and a T.

The question is, how many character comparisons do we do in this example?

I'm looking for a number.

This is not in terms of x and y, but

an actual number of character comparisons that we would do in this example.

Well one thing you'll notice is that most of the alignments will mismatch

immediately.

Right? It's not hard to see because

the pattern begins with a w, and yet most of the characters in the text T are not w.

So for most of the alignments the very first character comparison

will be a mismatch.

There are two w's in the text, and so there are two special alignments, and

for those two alignments we'll have some matches.

And if you add it all up, and I've highlighted in red here all of the places

where we mismatch, and in green all of the places where we match.

If you add up all of those matches and mismatches,

we end up with a total of 46 character comparisons.

40 of them are mismatched, as in 6 of them are matches.

So we can also write down the minimum and the maximum possible number of character

comparisons given the length of the pattern and the length of the text.

The minimum is 41.

The maximum is 164.

And so one thing that you'll notice about the number of character comparisons

that we do is that it's much closer to the minimum than it is to the maximum.

All right, so one positive thing that we can say about this algorithm is that in

practice, or at least in this example, but it's true for

many other examples too, the number of character comparisons that we do is a lot

closer to the minimum than it is to the maximum