对如何在科学的

Loading...

来自 University of California San Diego 的课程

当生物学遇上编程：生物信息学入门

478 个评分

对如何在科学的

从本节课中

Week 3

Which DNA Patterns Play the Role of Molecular Clocks? (Part 1)

- Pavel PevznerProfessor

Department of Computer Science and Engineering - Phillip CompeauVisiting Researcher

Department of Computer Science & Engineering

[MUSIC]

I am now introducing the median string problem, and

to define this problem, we need to define the notion

of distance not only between k-mers of the same size,

but also the distance between a k-mer and the longer string.

So what is this distance between two strings of different lengths.

We simply start from a smaller string (a k-mer) and compare it to the first k-mer

in the text.

Distance in this case is seven, we move further. distance

six, continue, continue, continue until the end of the string.

And then, return to the place where the distance was minimal.

So in this way, we will find a k-mer within a longer string that is most close

to our k-mer of interest, and this is the distance between a pattern and a string.

What is the distance between a k-mer and a set of strings.

In this case, it's simply the sum of distances

between a pattern and each string in the set.

For example, for a pattern AAA and a set of strings shown on the slide,

we can compute distance with every string and simply sum up all the distances.

The result

is five. And median string for the set of strings

Dna is a k-mer pattern minimizing distance between pattern

and DNA over all possible k-mers.

And our median string problem will be based on the

problem we formulated before, which is yet another equivalent motif finding problem.

If you remember this problem, then it looks like

even more complicated than the original motif finding problem,

because in this problem, we need to find

a k-mer and a set of motifs satisfying certain properties.

But in fact, the only thing we need to find

is a k-mer pattern as shown in this median string problem.

In this case, we search for a k-mer

pattern minimizing distance between this pattern and the

set of strings Dna (among all possible k-mers).

Now, there is a very simple algorithm for solving this problem.

We simply need to try all possible k-mers and see whether the

distance between the pattern and DNA is minimal

among all possible k-mer.

The running time of this algorithm is 4 to the power k times n*t*k, where

we analyze

t sequences of length n. So what we have just accomplished?

We started from motif finding problem, and the brute

force algorithm for solving this problem was incredibly slow.

We've turned it into yet another equivalent motif finding problem,

and through this problem, we were able to switch to the median string problem

with this very different run time.

In fact, it's still an exponential algorithm

but for practical applications (where k is usually less than 15)

we still can run this algorithm. So dramatic improvement on the Motif Finding

Problem that we started from. So sometimes, change of perspective helps.

And to move further,

we will notice that although the Median String

Problem is much faster than the Motif Finding Problem,

it's still slow if we search for

long motifs

and little bit later we will figure out how to deal with such motifs.

Our last topic in this segment is Greedy Motif Search.

We'll now talk about a greedy algorithm, for solving the Motif Finding Problem.

Given a set of motifs,

we have already learned how to construct the consensus string.

Now let's construct the count matrix

where in every column we simply have counts for all nucleotides.

And also. this count matrix can be transformed into the so-called

profile matrix where we have frequencies of nucleotides in every column.

Now, these frequencies of nucleotides in every column

can be viewed as a four-sided, biased dice,

representing probabilities of ending up on a given face of a dice.

And we can view it as a probability distribution.

And if it is a die, let's ask a question: "What is the probability that tossing

such a die based on the motifs will generate a given string of DNA?"

For example,

what will be the probability for generating a string shown at this light?

So given the following profile, we ask, what is

the probability of generating the consensus string, AAACCT?

And this probability will be simply computed by

multiplication of corresponding elements in the profile matrix.

If you take

another string, we will get a different probability.

And you have already noticed that the closer the string is to the consensus

string of the profile, the higher is the probability of generating this string.

And now, we define the notion of Profile-most probable k-mer in a sequence.

This is a k-mer with the

highest probability among all k-mers in the string.

And here's an example of how I can

generate Profile-most probable k-mer in a string.

We start from the first k-mer in the string, compute probability,

and record it in the last column of the matrix.

We continue until we fill all the matrix, and

finally, select the largest probability, which in this case, corresponds to the

red entry in the matrix. And with this at hand,

we are now ready for GreedyMotifSearch, which works in a very

simple way. We start from the set of i-1 motifs

selected from the first i-1 sequences

and show how to extend this set

of motifs by the motif in the i-th sequence.

What we do, we form profile of motif from the first i-1 1 sequences

and then select profile most probable k-mer in the sequence number i.

Afterwards, we iterate.

And that will the result in the greedy algorithm for

solving the problem.

And we will see later whether it works or doesn't work for our goals.