Optional: From Implanted Patterns to Regulatory Motifs (Part 1) (10:09)
支持 HTML5 视频
的 Web 浏览器
来自 University of California San Diego 的课程
University of California San Diego
Which DNA Patterns Play the Role of Molecular Clocks? (Part 1)
Optional: From Implanted Patterns to Regulatory Motifs (Part 1) (10:09)
Optional: From Implanted Patterns to Regulatory Motifs (Part 2) (05:06)
Optional: From Implanted Patterns to Regulatory Motifs (Part 3) (07:22)
Department of Computer Science and Engineering
Department of Computer Science & Engineering
Today we will talk about algorithms for finding regulatory motifs in DNA.
And we will start from game when we will be implanting patterns in random strings.
Let's generate ten random sequences.
And then let's take a 15-nucleotide-long pattern, and insert it at
random positions in these ten sequences. Now, turn your head,
I will hide these patterns, and tell me where these patterns are.
What algorithm should you run to figure out where I have inserted this pattern?
And you already probably recognized that if we slightly modified our
frequent words problem, we will find the pattern I have inserted.
Indeed, you can concatenate all the sequences and
find the most frequent word in the resulting concatenate.
The implanted pattern will appear ten times in this concatenate.
It is surprisingly frequent, which will most most likely mean
that it was the pattern that I have inserted.
Now let me change the experiment slightly. Now instead of the
I implant without any changes,
I will insert the same 15 nucleotide long patter with
four random mutations at random positions.
And, in this is case, this pattern forms a so-called (k,d) motif
which is a k-mer that appears in every sequence
at most d mutations.
Can you find such a pattern if I hide from you where I inserted it?
Do you think that the Frequent Words Problem will be able to help us?
Well, that all may be entertaining
but what biologists think about this game?
I think biologists are falling asleep, because it's absolutely
unclear what this problem has to do with real biological problems.
And I will try to show you that this problem is
actually about a biological problem of finding regulatory elements in DNA.
My next question will be, do we have a clock gene?
Think about our daily schedule
and think how our life depends on day and night.
In every cell, we probably have to express different
proteins depending on what time of the day is it.
But who are these molecular time keepers who tell the
cells in our body what is the time right now?
How do these molecular time keepers change gene expression to
produce proteins needed at night as opposed to proteins needed in the morning?
And anoother questions that is relevant to this, can we find genes responsible
for sleep disorders When our Circadian Clock has problems.
Now, we will focus on plants rather than human because
for plants, keeping time is a matter of life and death.
Just think about photosynthesis, flowering, or frost resistance.
If you don't know what is the time, you will be dead if you are a plant.
And, the question that arises is how probably a thousand plus
genes in plants follow the circadian clock.
Who controls them?
Who are the molecular managers who tells these
genes to increase or decrease gene expression?
Here comes the surprise.
Just three genes in plants are molecular managers that are
responsible for orchestrating this circadian behavior.
And they're called CCA1, LCY, and TOC1. These genes are regulatory proteins,
also know as transcription factors. And these genes,
to control other genes,
(to exert control over circadian clock) they bind to short regions
(maybe 10 nucleotide, 15 nucleotides) in the upstream regions of the genes.
For example, if one of these regulatory
proteins wants to control a particular circadian gene,
it probably has a region within thousand nucleotides
from the start of the gene where it binds.
For example, here's CCA1, one of the three regulators.
and to exert control over these genes, it has to bind in the upstream reason
of these genes. But how does CCA1 know where to bind?
Probably there are some hidden messages that tell CCA1: "Bind here!"
Our goal today is to find this particular hidden
message (that's shown on the slide) where CCA1 binds to.
Of course, we don't know where these hidden messages are actually hiding.
In the upstream regions, and today we will learn about algorithms
that are aimed at finding such hidden messages.
We will start from formulating the problem.
So, when we talked about implanting a 15-mer with four
random mutations, we figured out that the Frequent Words Algorithm wouldn't be
able to find it, because there are no frequent words
that appearing without mutations in the resulting strings after implantations.
That's why we need to solve a different problem that I call the
Implanted Motif Problem. In this problem, you are given a set of
stings Dna, and integers k and d. And you need to find all (k,d) motifs in Dna.
How should we solve this problem?
Should we possibly explore all possible 4 to the power k k-mers and
see which of them represent (k,d)-motifs. That will take time.
That's why, let's try something else.
Let's try to see whether a pairwise comparison
(comparing sequences to each other) would help.
Think about this.
Each sequence is random.
The only non-randomness in these sequences is
are these implanted motifs, and that's why they probably exhibit
larger similarity than other regions of random sequences.
Thus, my idea, let's find the most.
let's say in the first and second sequence,
and it will give us an idea on how the implanted pattern looks like.
Would it work?
Let's figure out whether the pairwise comparison between
strings will help us to find the implanted motifs.
Well, unfortunately, it won't work because when we implant a pattern, it has
four differences from the original pattern, but we don't have
access to the pattern, so there is nothing to compare with.
The only thing that we have are implanted instances.
But every two implanted instances actually
may be four plus four mutations apart.
How we can find them? Since pairwise comparison won't work,
maybe the only option we have is just to
explore all four to the power k possible k-mers.
Should we explore all of them?
Not necessarily, because if a k-mer is so far away from all k-mers
in the strings that we analyze, there is no reason to explore it.
It cannot possibly be an implanted pattern.
That's why motif enumeration algorithm would
look like this.
We will start from each k-mer, from a sequence.
Let's call it a.
For each such k-mer, let generate all possible mutations with up to
d nucleotides mutated, and for each such mutated k-mer
a', let's check whether it appears as a (k,d)-mer in the string.
That will work,
(it will be slow but definitely will work if k is small)
but the question we should ask, would it solve the real biological problem?
Unfortunately it won't, it won't work because our
model for implanting patterns is not very biological.
It doesn't properly reflect biological reality.
I would say a little bit
messy biological reality, because when biologist generate,
let's say, a set of genes, that are
controlled by circadian clock, they cannot guarantee that
all these genes have a particular pattern implanted.
Some genes don't have any patterns implanted in their upstream region.
As a result, we need to find out a way to search for motifs, even if
some sequence don't have any motifs implanted.
And that makes our life, little bit more difficult and we need to develop scoring
for every set of motifs, even in the case when some of the implanted pattern
don't look very similar to the canonical motif.
And we'll move to the next topic, which is motif finding problem
that is more
adequate for finding regulatory motifs.
© 2018 Coursera Inc. 保留所有权利。