我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

Loading...

来自 约翰霍普金斯大学 的课程

DNA 测序算法

278 评分

我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

从本节课中

Algorithms for assembly

In the last module we began our discussion of the assembly problem and we saw a couple basic principles behind it. In this module, we'll learn a few ways to solve the alignment problem.

- Ben Langmead, PhDAssistant Professor

Computer Science - Jacob Pritt

Department of Computer Science

We're now at the point where we can formulate a computational problem

that when we solve it, will in turn solve the genome assembly problem.

This first formulation won't be perfect, but

it'll be a good starting point for our discussion of the assembly problem.

Kind of in the same way that the naive exact matching problem wasn't perfect, but

it was good starting point for our discussion of read alignment.

So, the computational problem is called the shortest common superstring problem,

which will sometimes abbreviate as SCS.

So the shortest common superstring problem is this.

Given a collection of input strings, a set of input strings, we'd like to find

the shortest string that contains all of our input strings, as substrings.

So for example, here I have some strings and

I'd like to know their shortest common superstring,

the shortest string that contains each of these strings as a substring.

Well, if we didn't have the requirement that the superstring be the shortest

superstring, then this problem would be easy,

we could simply concatenate all the strings in the set S.

So this is a string that contains all the input strings as substrings but

it's not the shortest.

So, it turns out this is the shortest common superstring,

so this is a string that contains all the input strings as substrings and

there is no shorter string that does this, that contains all the input strings.

So let's say for a moment that we have an algorithm for finding the shortest common

superstring of a set of strings, and we'll discuss such an algorithm soon.

So if we took our sequencing reads and we gave them to this algorithm, and

asked the algorithm to find their shortest common superstring,

then the solution to the problem, the shortest common superstring problem,

would also be an assembly of the genome.

It would be a reconstruction of the original genome sequence.

So let's look at an example.

Here's our overlap graph from the previous lecture.

So this graph is built from a synthetic data set that

contains the reads that we made by taking all the 6-mers of the genome.

And let's instead take all these 6-mers and then throw them

into an algorithm that solves the shortest common superstring problem.

So here I have some Python code that calls a function called scs which we haven't

implemented yet but we will implement it in an upcoming practical.

And we pass in a list, a Python list of strings and

those strings are the read sequences.

And then what we get back is the shortest common superstring which is shown here.

And you can see that this shortest common superstring is in fact

equal to the genome that we derive the reads from.

Okay, so when we do that, we get the correct answer so

we should be pretty pleased with our progress so far.

We've formulated a computational problem that seems to correspond very closely to

the assembly problem that we would like to solve.

Furthermore, this computational formulation has a nice feature,

it's going to find us the shortest common superstring,

not just any common superstring, but the shortest one.

And in some sense, this is like the most, the simplest explanation, or

the most parsimonious explanation for the set of input reads that we started with.

And that seems like a good thing, that sounds like a good property to have.

Well, so now, unfortunately,

the shortest common superstring also has some substantial downsides.

So, we'll look at these and some potential solutions in some upcoming lectures, so

let's look at the first downside.

The first down side to this problem is that it is just not tractable.

There are no efficient algorithms for solving it.

So the technical term is that the shortest common superstring problem is NP-complete.

And NP-complete problems are very, very unlikely to have an efficient solution,

this doesn't mean it can't be solved.

In fact, we'll see an algorithm for solving it pretty soon.

But it's not going to be very fast,

and as the number of input strings grows, it's going to get slow very, very quickly.

So let's see an outline for the sort of algorithm that we can use

to solve the shortest common superstring problem.

So let's say we have some input strings and

assume for a moment that they're all the same length.

And what we're going to do is,

we're going to put them in some order, like this.

Maybe these are our input strings and

this is the first order we're going to try to put them in.

This happens to be alphabetical order.

And then, we're going to, for every adjacent pair of strings in

this ordered list we're going to find the longest overlap between them.

So that is, we're going to find the longest suffix of the string on the left

that matches a prefix of the string on the right.

And then we are going to glue them together accordingly.

We're going to merge them together accordingly.

So for example, these first two strings in our

ordered list overlap by two characters.

The suffix AA from the first string overlaps the prefix AA from

the second string.

And there are no longer overlaps between these strings so

we glue them together like this.

So we started with the first string, AAA.

And then we tacked on just the B from the second string.

We ignored the overlaps characters because those are already in our superstring that

we're building, and we just added on the B from the second string.

Okay, so now, let's look at the second pair.

And again, they have an overlap of length two because there's a suffix AB for

the string on the left and there's a prefix AB for the string on the right.

So this time, we're going to concatenate just the A from the second string.

So we're concatenating just one more A onto our superstring that we're building.

And we can keep going like this.

So for this third pair, the overlap is one, right?

Because the suffix A matches the prefix A.

So we're going to add on BB to the end of our superstring so far and so on.

We'll just keep doing this for every adjacent pair of strings until we've done

all those pairs, and at the end of the day, we have a candidate superstring.

It's not necessarily the shortest common superstring, but it might be.

If we pick the right ordering then it will be the shortest common superstring.

So for a given ordering, that's how we produce the candidate superstring.

And again, it's not necessarily the shortest.

And for different orderings we'll get different superstrings.

So, for example, here's another kind of order that we could put the strings in.

It's just a slightly different order than the first one.

And then through the same procedure we'll produce a superstring.

And, lo and behold, you can see that it's shorter than the first superstring by

I guess that is four characters, it's shorter than the first superstring.

So if we want to find the absolute shortest common superstring

we're going to have to try all the ordering.

So that's our algorithm for every possible way of ordering the N input strings, for

every permutations of the n input strings.

We're going to glue adjacent pairs of

strings together according to their maximal overlap.

And then, the shortest common superstring that we get over all those orderings

is the shortest common superstring overall.

So why is this so slow?

Why do I say that this problem is intractable?

Well, this particular algorithm is slow because the number of possible

orderings that we have to try is equal to the number of input strings,

let's say that's n, that's equal to n factorial,

because that's how many different permutations of n there are.

So n factorial is a quantity that grows very quickly as n grows.

As the number of input strings grows, the amount of work that we have to do,

the amount of permutations we have to try grows very, very rapidly.

That's why we call this problem an intractable problem.

So, in the coming practical session, we'll implement this idea and

then in the following lecture we'll talk about one way that we can speed things up

quite a bit, though, kind of phrase.