对如何在科学的

Loading...

来自 University of California San Diego 的课程

当生物学遇上编程：生物信息学入门

460 个评分

对如何在科学的

从本节课中

Week 3

Which DNA Patterns Play the Role of Molecular Clocks? (Part 1)

- Pavel PevznerProfessor

Department of Computer Science and Engineering - Phillip CompeauVisiting Researcher

Department of Computer Science & Engineering

In the last section,

we saw that the motif enumeration algorithm that we proposed had a flaw:

It assumed that the motif that we are looking for

must occur in every string in the collection of strings under consideration,

which is not necessarily true in biological applications.

As an alternative, our plan is to assign a score to a given collection of k-mers

chosen from the strings.

So, say that we have collected such a set of k-mers,

which we call Motifs, represented here as a matrix.

We indicate the most frequent nucleotide in each column of this matrix

by a capital letter, breaking ties arbitrarily.

So note that positions two and three in the motif matrix are the most conserved.

Nucleotide G occurs in these positions in every k-mer,

whereas position ten, at the end of the motif matrix, is the least conserved.

A common method of visualizing a motif matrix is with a motif logo, in which

the size of the nucleotide correlates with how often it occurs in the motif matrix.

Taking the capital letter in each column, construct a consensus string for

this particular choice of k-mers that will also provide us with a candidate motif.

Our goal then is to choose k-mers from the underlying strings, so

that they produce the most conserved motif matrix.

Then, simply take the consensus string as our desired biological motif.

We therefore are going to define the score of the motif matrix

as simply the number of lowercase letters in the matrix.

Counting the number of lowercase letters column by column,

we see that the score for this choice of motifs is equal to 30.

We can now restate our goal as choosing a collection of k-mers

from the underlying strings in order to minimize this score.

So we now have a more appropriate computational problem,

which we call the Motif Finding Problem.

A brute force algorithm for the Motif Finding Problem would simply

consider every possible choice of k-mers to form a motif matrix and

then simply take a collection of k-mers having the lowest score.

But, as with any brute force approach,

we need to ask ourselves how fast this algorithm is.

Assuming that there are, say, t strings under consideration, all of length n,

then there are (n-k+1)^t such possibilities for

how to form a motif matrix.

Then, scoring the matrix requires k*t steps.

So, assuming that k is much smaller than n, as is the case in practice,

the overall running time of this approach is approximately O(n^t * k*t).

Now, the n^t term means that this algorithm is useless for

practical values of n and t, and so we're going to need a faster algorithm.

So, to improve on our existing brute force approach,

we will rethink the motif finding problem.

Specifically, I've noted here that, when we choose the motif matrix,

it automatically gives us a consensus string.

Our new approach will be to choose the consensus string first,

then look for a motif matrix that scores best against this consensus.

How do we achieve this?

Well, until now, we have been scoring the motif matrix column-by-column.

However, we could just as easily compute the number of lower case symbols on

the motif matrix row-by-row.

I want you to note that the number of lower case symbols in a row

is equal to the number of symbols in that row that don't match the consensus,

which we know as the Hamming Distance between that row and the consensus string.

As a result, the score of the motifs matrix is just the sum

Hamming Distances between each string in the matrix and the consensus string,

which we're going to write as d(Consensus(Motifs), Motifs).

This discussion now leads us to the equivalent Motif Finding Problem,

in which we are looking for a pattern that will serve as our consensus string,

i.e., a pattern that minimizes the distance to motifs

over all choices of motifs in the underlying strings.

It may still seem that we have not made our work any easier though,

since we are now looking for a pattern in addition to a collection of motifs,

whereas before we just looked for a collection of motifs.

However, we will see in the next section that we don't need to search through every

possible choice of motifs in order to find an optimal consensus string,

which is what we're looking for.