对如何在科学的

Loading...

来自 University of California San Diego 的课程

当生物学遇上编程：生物信息学入门

459 个评分

对如何在科学的

从本节课中

Week 4

Which DNA Patterns Play the Role of Molecular Clocks? (Part 2)

- Pavel PevznerProfessor

Department of Computer Science and Engineering - Phillip CompeauVisiting Researcher

Department of Computer Science & Engineering

[MUSIC]

We will now talk about how this issue of zeros in

randomized algorithms can be addressed. And it's not a minor annoyance,

it's actually a serious issue because randomized algorithms do

do not like working with zeroes.

We will now learn how to deal with zeroes in the profile matrix.

And in the real world, there is always a possibility

that we do not observe one of the possible events.

In this case, we assign them the empirical probability zero.

But in fact, their real probability is not zero.

It's simply the size of our sample is too small.

And randomized algorithm hate zeroes because zeroes

affect their performance in a serious way.

That's why we need to learn how to get rid of zeroes.

Fortunately, Laplace helped us to figure out how to do this.

Laplace calculated the probability that the sun will not rise tomorrow,

given the fact that it has risen in the last 5,000 years.

And this probability is roughly one in two million, according to Laplace.

His estimate was ridiculed because his opponents

did not realize how important this calculation was.

If we repeat an experiment that can result in a

success or a failure n times and get s successes,

then what is the probability that the next repetition will be a success?

In other words, if X_1, X_2,

X_{n+1} conditionally independent Boolean

random variables (where failure corresponds to 0 and success corresponds to 1),

then the traditional way to compute the probability of X_{n+1} is simply

to compute the fraction of successes among among all trials, so it will be s/n.

However, since we have a prior knowledge,

that successes and failures exist, then they actually have

two more observations implicitly, even before we started our trials.

So, even before we started our trials we

should acknowledge that success and failure are possible and

therefore there are two more "invisible" events.

That was Laplace argument, and therefore he made a correction

based on the fact that when we see n events, we actually see n+2 events.

And when we see s successes, we actually see s+1 successes,

and the formula changes to (s+1)/(n+1).

It may sound counter-intuitive but let's see how Laplace

would get rid of zeroes in the profile matrix.

In this case, what we need to do

to update the profile matrix to get rid of zero?

According to the Laplace's rule of succession, we

should assume a possibility that every event represented in the

profile matrix can argue.

And therefore we should add 1s to every entry of the profile matrix.

And here I show how Laplace would update the

the count matrix by adding 1s to all entries.

As a result, the profile matrix also will get updated, all probabilities

will get updated, and we

will get rid of all zeroes.

Now, instead of a rolling a one-sided die, we can actually roll a seven-sided die.

And based on the result of rolling a seven-sided die, we will

choose a new instance of the motif in the sequence.

Here it is.

Afterwards, we iterate, and let's see how our motifs are changing.

Starting from this, we once again continue

with Gibbs sampler, remove one of the sequences,

form motif matrix, build count matrix, it

update it with pseudocounts by adding ones to every entry,

compute profile matrix once again, compute probability of every k-mer,

and once again roll a seven sided die. The seven-sided

die leads to changing once again one of the choices of our motif and we

start a new iteration.

Again, we randomly remove one of the sequences, construct count matrix,

apply Laplace's rule of succession, create profile,

calculate probability again, roll a seven-sided die,

and derive (once again, starting from completely random motif),

in just three iterations,

we almost captured the correct implanted motifs.

Which means that the bias that implanted motif introduced was correctly

captured by Gibbs Sampler, and led us to the correct motif.

We are not still there, because, in the second

sequence, we have not found the correct motif yet.

But surely, at the next iteration, with high probability, we'll find it.

We're almost

done with covering randomized algorithms, but the

challenge of predicting the degenerative signals remains.

Remember how randomized motifs search did not find

the perfect binding sites for this motif?

What can we do?

Well, there is one possibility. Maybe, maybe we can develop even better

algorithm for finding motifs, but if you

think about this, imagine that you implant not

not a motif motif of length 15 with 4 mutations,

but motif of length 15 with 5 mutations.

Finding this motif is almost impossible, not because we don't have

efficient algorithms, but because random motifs

start competing with real motifs. Simply statistically,

it may not be possible to find this motif if we

implant them in long DNA strings. For example into strings of length 1,000.

Well, what can be done?

Should we give up in this case? Biologist did not give up and the

way to solve this problem today is to resort to the technique

which is called ChIP-sequencing. Essentially what the cheap ChiP-sequencing

does, it reduces the length of DNA strings to look for regulatory motifs.

As I mentioned, when you insert elusive patterns in the

strings of length 1,000, it may not be possible even to find it even with the best

available algorithms. But if biologists reduced the area

where to look for motif, let's say from 1,000 to 100, then suddenly

statistics works in our favor, and we can find such motifs.

And this is the end of the story about randomized algorithms.