对如何在科学的

Loading...

来自 University of California San Diego 的课程

当生物学遇上编程：生物信息学入门

450 个评分

对如何在科学的

从本节课中

Week 4

Which DNA Patterns Play the Role of Molecular Clocks? (Part 2)

- Pavel PevznerProfessor

Department of Computer Science and Engineering - Phillip CompeauVisiting Researcher

Department of Computer Science & Engineering

[MUSIC]

You may be surprised but the most popular

algorithms for finding motifs in DNA sequences, roll dice

and take chances to find such motifs. They're called randomized algorithms.

How do they work?

We have now learned how to construct Profile given

Motifs, but can we construct Motifs given Profile?

We can, because given an arbitrary Profile and a set of sequences Dna,

we can construct the set Motifs(Profile,DNA) by simply finding

Profile-most probable k-mers in each sequence from Dna.

And therefore, we can further iterate this process by finding profile of motifs,

motifs of profiles, profile of motifs, until our motifs keep improving.

So this is the key idea of the algorithm called RandomizedMotifSearch.

We simply iterate and see whether the score of motif continues

to improve. And stop when it doesn't improve anymore.

The only question about this algorithm is

how in the world it can find anything useful.

Think about this, we started from a completely random motifs.

If the motifs are completely random,

and if they're taken from random sequences,

then presumably. expected probability of every element

in the profile matrix is 1/4.

There is absolutely no information in such a motif.

How possibly it can lead us to the correct motif in the sequence?

Let's just keep our fingers crossed and see how this approach works on an example.

Here's my example.

These are very short DNA strings and I implanted

4-nucleotide long motifs (one motif on each string). They're shown in blue, so

ACGT is implanted in every sequence with a single mutation.

So how randomized motif search will proceed?

First, it will randomly select motifs shown in bold here.

And, of course, it has missed most of the motifs.

They are randomly selected.

Afterwards, it will form Motifs matrix, shown here on

the slide, and afterwards it will construct Profile(Motifs).

Using Profile(Motifs) it will calculate the probability

of every k-mer in the set of strings,

shown here, and will simply select the most probable k-mers in every string.

The results are in this matrix, and I should probably start iterating.

But if you look carefully at this matrix, I don't have to iterate.

It already found all blue inserted motifs right there.

It almost sounds like a magic

trick, because we started from something completely random

and, in a single iteration, found our implanted motifs!

Well, what happened?

So let's try to think about this.

When we form k-mers randomly, it results in a uniform expected profile,

as we discussed before. But strings in Dna are not random

because w have inserted implanted patterns

in our Dna strings. Therefore, this implanted motif

resulted in a biased profile; it's not a completely random profile.

Where does this statistical bias point to? If it points

in the direction of the implanted motif, then there is no miracle

that we start from something random (which is not really random),

and, applying iterative procedure, arrived to a hidden part.

So, let's see how it actually works.

I return back to the case that we considered before.

You saw that when we randomly selected

motifs, in the first sequence, we missed the motif (the correct, the blue one).

In the second sequence, we also missed it.

In the third sequnce, we missed it.

In the fourth sequence, we missed it.

But in the fifth one, simply by pure chance, we actually found it.

It's not a big deal.

There is always a probability that we simply by chance

select one of the implanted motifs.

But it results in a biased profile.

If you look at the profiles that

resulted from this implantation, then you see that the largest elements

in this profile actually correspond to A, C, G and T

of our implanted pattern ACGT. Therefore, even

a single correctly found k-mer in

the set of Dna strings creates a bias in the profile, and this bias

leads us to the correct sequence. Of course, it doesn't always happen.

It depends on how we have chosen initial

set of randomly selected k-mers in the sequence.

But if we try this algorithm, a thousand or

a million times, there is a good chance that

eventually one of these attempts will bring us to

the correct motif. Even if this procedure sounds ridiculous.

We now move to the next topic: "How do bacteria hibernate?"

Our goal is to apply our motifs finding algorithm to real biological problems.

To figure out how our algorithms work on

real biological data, we will start by analyzing tuberculosis.

Tuberculosis is a deadly disease that

killed quarter of population of Europe in the 19th century.

Unfortunately, humanity is not done with tuberculosis yet since

tuberculosis strains that resist to all antibiotics are now emerging.

Tuberculosis is caused by

Mycobacterium tuberculosis also called MTB.

An interesting thing about MTB

is that it can persist in for decades in humans

and one third of the world population is affected by MTB.

Of course, it usually doesn't lead to tuberculosis

nut, in some cases. bacteria switches from its latent

state to the active state and it's when the tuberculosis begins.

The question that we will be asking is how

MTB switches from the latent to the active state and how

does it survive during latency? MTB has an ability

to form dormant spores where metabolism is shut down.

In such sporulated form it can survive in

very tough conditions, for example, in conditions of food shortage or

oxygen shortage. And oxygen shortage, called hypoxia, is

associated with these latent forms of tuberculosis.

So, how can we figure out what makes

MTB switch into the latent state and return back to

the active states?

Biologists have performed a DNA array experiment and identified

roughly 25 genes that are activated in the hypoxic condition.

So when bacteria is deprived from

oxygen, these gene dramatically change their expression.

And our goal today is to discover what regulatory

motif controls these genes when they're activated or repressed.

We now have quite an arsenal of algorithms for finding motifs.

Let's try them.

But the first problem is that we have no clue of what size of motif to try.

Should we try k equal to 8 or k equal to 20? Let's try all of them

using median string problem and randomized motif search.

And you will see that, for different k, we are getting quite different results.

It is not clear which one to choose.

for median string, we actually will have

difficulties going beyond k=12 because

it is an exponential algorithm whose running time grows as four to the power k.

But for randomized motif search we can search

for very long motif and, for k=20,

we actually find this rather long motif.

You can see that there are similarities

Between the 12-mer found by the Median String.

the 12-mer found by the RandomizedMotifSearch,

and the 20-mer found by the RandomizedMotifSearch. Which one is correct?

To answer this

question, we need to introduce the notion of the motif entropy.

And entropy of motif is defined in the following way.

First, we construct the profile matrix.

Every column of the profile matrix correspond to a biased four-sided dice.

Or, in other words, to a probability distribution, (p1, p2, ...p_n).

Sum of p_i is equal

to 1, of course. And entropy is simply defined as

- p_i * log (p_i)

and summed up over all indices i. For example, we can compute the entropy of

the first column, and it will be 1.147.

We can compare it with the minimal

possible entropy, which corresponds to the extremely conserved motif

(where all symbols in the column are the same),

and with the maximal entropy

(where symbola in the column are equally likely).

So the entropy of a motif is defined as the sum of the entropies of its columns.

And now, the moment of truth.

Let's see how the algorithms that we developed

find out whether the motifs that we find are close to the biological one.

Here is the motif logo of the dormancy survival regulator site.

The motif logo is constructed in such a way that the total length of four

nucleotides in the motif logo correspond to two minus entropy.

So for completely conserved column, like nucleotide G

in this motif, we will have height two.

It is the most conserved columns of the motif.

The relative sizes of four nucleotide in

each position correspond to frequencies of these nucleotides in the column.

This is an example of

motif logo.

Now, let's see how it works for the tuberculosis example.

You can see that our randomized motif

search has actually captured a good portion of the motif.

But you can also see that we missed in some places.

We need to work harder and to figure

out what to do to develop better regulatory motif finding algorithms.

We move next to another randomized algorithm called Gibbs sampler.