Optional: How Rolling Dice Helps Us Find Regulatory Motifs (Part 1) (12:43)
支持 HTML5 视频
的 Web 浏览器
来自 美国加州大学圣地亚哥分校 的课程
Which DNA Patterns Play the Role of Molecular Clocks? (Part 2)
Optional: How Rolling Dice Helps Us Find Regulatory Motifs (Part 1) (12:43)
Optional: How Rolling Dice Helps Us Find Regulatory Motifs (Part 2) (05:37)
Optional: How Rolling Dice Helps Us Find Regulatory Motifs (Part 3) (07:46)
Department of Computer Science and Engineering
Department of Computer Science & Engineering
You may be surprised but the most popular
algorithms for finding motifs in DNA sequences, roll dice
and take chances to find such motifs. They're called randomized algorithms.
How do they work?
We have now learned how to construct Profile given
Motifs, but can we construct Motifs given Profile?
We can, because given an arbitrary Profile and a set of sequences Dna,
we can construct the set Motifs(Profile,DNA) by simply finding
Profile-most probable k-mers in each sequence from Dna.
And therefore, we can further iterate this process by finding profile of motifs,
motifs of profiles, profile of motifs, until our motifs keep improving.
So this is the key idea of the algorithm called RandomizedMotifSearch.
We simply iterate and see whether the score of motif continues
to improve. And stop when it doesn't improve anymore.
The only question about this algorithm is
how in the world it can find anything useful.
Think about this, we started from a completely random motifs.
If the motifs are completely random,
and if they're taken from random sequences,
then presumably. expected probability of every element
in the profile matrix is 1/4.
There is absolutely no information in such a motif.
How possibly it can lead us to the correct motif in the sequence?
Let's just keep our fingers crossed and see how this approach works on an example.
Here's my example.
These are very short DNA strings and I implanted
4-nucleotide long motifs (one motif on each string). They're shown in blue, so
ACGT is implanted in every sequence with a single mutation.
So how randomized motif search will proceed?
First, it will randomly select motifs shown in bold here.
And, of course, it has missed most of the motifs.
They are randomly selected.
Afterwards, it will form Motifs matrix, shown here on
the slide, and afterwards it will construct Profile(Motifs).
Using Profile(Motifs) it will calculate the probability
of every k-mer in the set of strings,
shown here, and will simply select the most probable k-mers in every string.
The results are in this matrix, and I should probably start iterating.
But if you look carefully at this matrix, I don't have to iterate.
It already found all blue inserted motifs right there.
It almost sounds like a magic
trick, because we started from something completely random
and, in a single iteration, found our implanted motifs!
Well, what happened?
So let's try to think about this.
When we form k-mers randomly, it results in a uniform expected profile,
as we discussed before. But strings in Dna are not random
because w have inserted implanted patterns
in our Dna strings. Therefore, this implanted motif
resulted in a biased profile; it's not a completely random profile.
Where does this statistical bias point to? If it points
in the direction of the implanted motif, then there is no miracle
that we start from something random (which is not really random),
and, applying iterative procedure, arrived to a hidden part.
So, let's see how it actually works.
I return back to the case that we considered before.
You saw that when we randomly selected
motifs, in the first sequence, we missed the motif (the correct, the blue one).
In the second sequence, we also missed it.
In the third sequnce, we missed it.
In the fourth sequence, we missed it.
But in the fifth one, simply by pure chance, we actually found it.
It's not a big deal.
There is always a probability that we simply by chance
select one of the implanted motifs.
But it results in a biased profile.
If you look at the profiles that
resulted from this implantation, then you see that the largest elements
in this profile actually correspond to A, C, G and T
of our implanted pattern ACGT. Therefore, even
a single correctly found k-mer in
the set of Dna strings creates a bias in the profile, and this bias
leads us to the correct sequence. Of course, it doesn't always happen.
It depends on how we have chosen initial
set of randomly selected k-mers in the sequence.
But if we try this algorithm, a thousand or
a million times, there is a good chance that
eventually one of these attempts will bring us to
the correct motif. Even if this procedure sounds ridiculous.
We now move to the next topic: "How do bacteria hibernate?"
Our goal is to apply our motifs finding algorithm to real biological problems.
To figure out how our algorithms work on
real biological data, we will start by analyzing tuberculosis.
Tuberculosis is a deadly disease that
killed quarter of population of Europe in the 19th century.
Unfortunately, humanity is not done with tuberculosis yet since
tuberculosis strains that resist to all antibiotics are now emerging.
Tuberculosis is caused by
Mycobacterium tuberculosis also called MTB.
An interesting thing about MTB
is that it can persist in for decades in humans
and one third of the world population is affected by MTB.
Of course, it usually doesn't lead to tuberculosis
nut, in some cases. bacteria switches from its latent
state to the active state and it's when the tuberculosis begins.
The question that we will be asking is how
MTB switches from the latent to the active state and how
does it survive during latency? MTB has an ability
to form dormant spores where metabolism is shut down.
In such sporulated form it can survive in
very tough conditions, for example, in conditions of food shortage or
oxygen shortage. And oxygen shortage, called hypoxia, is
associated with these latent forms of tuberculosis.
So, how can we figure out what makes
MTB switch into the latent state and return back to
the active states?
Biologists have performed a DNA array experiment and identified
roughly 25 genes that are activated in the hypoxic condition.
So when bacteria is deprived from
oxygen, these gene dramatically change their expression.
And our goal today is to discover what regulatory
motif controls these genes when they're activated or repressed.
We now have quite an arsenal of algorithms for finding motifs.
Let's try them.
But the first problem is that we have no clue of what size of motif to try.
Should we try k equal to 8 or k equal to 20? Let's try all of them
using median string problem and randomized motif search.
And you will see that, for different k, we are getting quite different results.
It is not clear which one to choose.
for median string, we actually will have
difficulties going beyond k=12 because
it is an exponential algorithm whose running time grows as four to the power k.
But for randomized motif search we can search
for very long motif and, for k=20,
we actually find this rather long motif.
You can see that there are similarities
Between the 12-mer found by the Median String.
the 12-mer found by the RandomizedMotifSearch,
and the 20-mer found by the RandomizedMotifSearch. Which one is correct?
To answer this
question, we need to introduce the notion of the motif entropy.
And entropy of motif is defined in the following way.
First, we construct the profile matrix.
Every column of the profile matrix correspond to a biased four-sided dice.
Or, in other words, to a probability distribution, (p1, p2, ...p_n).
Sum of p_i is equal
to 1, of course. And entropy is simply defined as
- p_i * log (p_i)
and summed up over all indices i. For example, we can compute the entropy of
the first column, and it will be 1.147.
We can compare it with the minimal
possible entropy, which corresponds to the extremely conserved motif
(where all symbols in the column are the same),
and with the maximal entropy
(where symbola in the column are equally likely).
So the entropy of a motif is defined as the sum of the entropies of its columns.
And now, the moment of truth.
Let's see how the algorithms that we developed
find out whether the motifs that we find are close to the biological one.
Here is the motif logo of the dormancy survival regulator site.
The motif logo is constructed in such a way that the total length of four
nucleotides in the motif logo correspond to two minus entropy.
So for completely conserved column, like nucleotide G
in this motif, we will have height two.
It is the most conserved columns of the motif.
The relative sizes of four nucleotide in
each position correspond to frequencies of these nucleotides in the column.
This is an example of
Now, let's see how it works for the tuberculosis example.
You can see that our randomized motif
search has actually captured a good portion of the motif.
But you can also see that we missed in some places.
We need to work harder and to figure
out what to do to develop better regulatory motif finding algorithms.
We move next to another randomized algorithm called Gibbs sampler.
© 2018 Coursera Inc. 保留所有权利。