我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

Loading...

来自 约翰霍普金斯大学 的课程

DNA 测序算法

280 评分

我们将学会对 DNA 测序数据进行分析的计算方法——算法和数据结构。我们将学习一点DNA、基因组、以及 DNA 测序的应用的相关知识。我们将使用 Python 实现关键算法和数据结构，并分析实际的基因组和 DNA 测序数据集。

从本节课中

Algorithms for assembly

In the last module we began our discussion of the assembly problem and we saw a couple basic principles behind it. In this module, we'll learn a few ways to solve the alignment problem.

- Ben Langmead, PhDAssistant Professor

Computer Science - Jacob Pritt

Department of Computer Science

In this practical, we'll be implementing the shortest common super string using

just a brute force method to check all the possible combinations of reads.

So to do this, we're going to use the overlap

function from a previous practical, so I've already pasted that in here.

We're also going to use the permutations function which we

saw in a previous practical.

So I'm going to import itertools.

Now I'm going to define a function, scs,

that will give me the shortest common superstring of a set of strings, ss.

And so I'm going to start by just defining my shortest super string just as none.

I'm going to say for each set of strings, for

each permutation of set of strings,

in itertools.permutations(ss).

So, when I don't have a second argument in permutations,

it will give me all of the complete permutations of this list.

So this will return every possible ordering of the reads in this list.

And so what I'm going to do is loop through every one of these permutations

and try to put these reads together in order, in the order that I have,

and see what the length of that superstring is.

And we'll just keep doing that until I get the shortest one that I can find.

So I have this permutation of strings.

My superstring is going to start out by just being the first string in that list.

And now I'm just going to loop through every other string in that list and

append them onto this.

So I'm going to say for i in range up to the number of

strings minus 1, since we've already had the first string, we can skip that one.

I"m going to find the overlap length between the current string and

the next one.

Trying to get the ith string, the i plus one string.

I'm going to set the min_length to be 1.

And now, I'm going to take my superstring and append onto it,

the part of the next string that doesn't overlap that.

So that's going to be the suffix from the overlap length to the end, just like that.

And so now once I've finished this I've concatenated all these strings,

overlapping them as much as I can to get the superstring.

So now I'm going to test if this super string is smaller than the shortest one

I've seen so far.

So I'm going to say if my shortest super string

is none, or the length of my current super string is less than

the length of the shortest super string,

then I'm just going to replace the shortest superstring with my superstring.

And then when I'm done I just return the shortest superstring that I see.

>> So that loop is going to loop a large number of times.

If we add the n strings it's going to loop n factorial times.

But one of those iterations is going to give us the shortest

possible super string, so we'll return that and that's the right answer.

>> Mm-hm.

So now I can just remember this function on the set of strings.

And so I have my strings here.

This is the shortest common super string.

You can see the third string here

overlaps by four characters with this string here, so it goes first.

And then this first string overlaps the second string by about four characters.

So it's combined them like that, and

given us the shortest superstring that we can get from these reads.

>> Mm-hm.