Optional: Where in the Genome Does DNA Replication Begin? (Part 1)
支持 HTML5 视频
的 Web 浏览器
来自 美国加州大学圣地亚哥分校 的课程
Where in the Genome Does Replication Begin? (Part 1)
Optional: Where in the Genome Does DNA Replication Begin? (Part 1)
Optional: Where in the Genome Does DNA Replication Begin? (Part 2)
Department of Computer Science and Engineering
Department of Computer Science & Engineering
Hello and welcome to the first lecture of the Bioinformatics Algorithms class!
When I started teaching bioinformatics two decades ago, one of the biggest challenges
was to connect with a very diverse audience. I had to talk to computer scientists and biologists,
to physicists and biochemists, to mathematicians and statisticians.
The challenge was how to keep computer scientists focused while explaining sometimes convoluted
biological concepts. And likewise, how to keep biologists awake
while explaining sometimes complex algorithmic ideas.
I hope that in this class we have an equally diverse audience, and that's why I decided
to start from a very simple problem. Here is the first problem in this class.
Given a string called Text1, copy it to a string called Text2.
And I see that some of you are already thinking about quitting this this class because it
is so trivial. Please wait. You might think that it is trivial,
but billions of cells (replicating your own genome right now) don't think so.
In fact, they are trying to accomplish the incredibly complex task of replicating genome.
Just think about this: you need to unwind DNA (two strands of DNA) and separate them.
You need to make sure that there are enough free-floating nucleotides so replication can
complete because if you start replication and you have to stop in the middle because
there is no energy left, you are dead. So, in this course, we will learn about this
and many other challenges related to replication. In fact, replication of the human genome is
so complex that biologists still do not fully understand how its done.
That's why we will focus on replication of bacterial genomes that are 1000 times smaller
(their genomes are 1000 times smaller) and that's why replicating bacterial genomes is
a little bit easier. I am not saying that it is trivial.
And in fact, right now, there are billions of bacterial cells trying to replicate bacterial
genomes inside you because the number of bacterial cells inside you is 10 times larger than the
number of your own human cells. Now, since bacteria usually have circular
chromosomes, I have to slightly change my problem.
My new problem is: Given a CIRCULAR genome called Text1, copy
it to a CIRCULAR Text2. It is equally trivial, but it makes some slight
changes in the setup that we need to address.
A circular genome does not have a start and an end, and that's why the roughly 5 million
nucleotide-long E. coli genome has a choice on where to start replication.
In this class, we will try to figure out where it starts because all E. coli cells on Earth
start replication in the same region, called the "origin of replication".
So, our first problem is "Where in the genome does it all begin?" and the only information
that we have is the genome itself. Let's go ahead and first try to answer the
question: What are the hidden messages in the replication origin that help the E. coli
genome figure out where to start.
And here is the first, or should I say the 3rd problem in this course: finding the origin
of replication. I'll give you a genome, you will have to give me the location of origin
of replication in this genome. Does the problem make sense?
Yes [from the audience] You may be right or wrong depending on who
you are. "Are you a computer scientist or a biologist?"
Computer scientist [from the audience] Then you are wrong. If you were a biologist
then you could think about an experiment. For example, you can start cutting short pieces
from DNA, and when you cut a piece and the cell suddenly lost ability to replicate, it
means that probably you cut out the origin of replication.
But if you are a computer scientist, you should just refuse solving this problem; it is an
ill-defined computational problem. So, how can we turn this ill-defined computational
problem into something that makes sense? Lets try to ask a question: "how does the
cell know to begin replication in short oriC region?"
There must be some hidden messages in the genome that tell the cell "Start replication
right here!" What are these hidden messages?
Let's first try to formulate the problem right. The right formulation will be the Hidden Message
Problem. In this case, Input is a string Text representing
the replication origin, and Output is the hidden message in Text.
Is this problem clear? If you feel this problem clear, please raise
your hands. 1, 2, 3, 4, 5, 6, 7, 8...
This problem is absolutely unclear.
It is equally poorly defined as the first problem I showed you. Indeed, Son is not happy
again because what is a hidden message? Have I described it? No.
So, for a computer scientist there is not even a starting point to address this problem.
But let's go two centuries back and let's see how somebody in a somewhat similar situation
tried to solve a similar problem. We go to "The Gold Bug" short story by Edgar
Allan Poe; in this story, the hero, whose name was Legrand, tried to decode the message
left by pirates. And his hope was that when he decoded the
message, he would be able to find pirate treasure, so he was quite motivated.
The only thing he saw is this text. How would you tell what is written, what is
coded in this text? What Legrand noticed is that a combination
of three symbols ";48" appears surprisingly frequently in this text.
Does it suggest an idea on what is written in the message?
Here is the hint: The message is in English" And here is another hint: "THE" is the most
frequent word in English And that's why he was able to substitute ";48"
for "THE" and this message started to make sense. And if you have time, you can complete
the decoding. Now, if our original Hidden Message Problem
was clearly ill-defined, let's now try to redefine it so that it would make sense even
for a computer scientist. How we would redefine it? Well, for various
biological signals, certain words appear surprisingly frequently in the text. For example, on this
slide, AATTT appears surprisingly frequently in a short text -- its unlikely that it would
happen by chance.
Then we formulate our first real Rosalind problem. The Frequent Words Problem is: "I'll
give you a string and an integer k, and you need to find all most frequent k-mers in Text,
which means all most frequent strings of length k in the Text.
Is the problem clear now? Let's ask Son. He feels better, but he is still not clear
because he does not know what exactly a "most frequent word" means.
I think he is a little bit picky today because it's easy to define what a most frequent word
means. A k-mer Pattern (a string of length k) is
a most frequent word in Text if no other k-mer is more frequent than Pattern.
And now even Son is satisfied and I want to thank Son for great help with development
of this lecture. Now, Son is satisfied but we should ask biologists
whether this problem actually makes sense. Remember, bioinformatics is about connecting
computational and experimental ideas. And if we check with biologists, it also looks
that they are quite satisfied with this problem. Indeed, replication is performed by DNA polymerase,
but to initiate replication, DNA polymerase needs a protein called DnaA.
DnaA binds to a short region (typically just a 9 nucleotide-long segment) within the replication
origin that is known as a DnaA box.
A DnaA box is actually the hidden message that we are looking for, and the DnaA box
actually tells the DnaA protein: bind right here! And the DnaA protein wants to see multiple
DnaA boxes to bind efficiently. So, it looks like we brought computational
people and biologists on the same page, and we are now ready to try to solve this problem.
Since it's the first problem at Rosalind, we expect a wide variety of different algorithms
proposed by you. The simplest algorithm probably is this one.
Start from the 1st k-mer in the Text, slide it through the Text and see how many identical
k-mers are present in the Text. How much time it takes? It takes time roughly
equal to |Text| ∙ k. But you need to do it for every k-mer in Text, and there are
roughly Text of them. So, the overall running time of this very
naïve algorithm is |Text2|∙ k. If |Text| is one million, you may have problems,
but in our case |Text| is typically small. The average length of the replication origin
is just 500, maybe 700 hundred nucleotides.
In this course, we will soon learn much more sophisticated algorithms: for example, some
of you may be amazed that the same task can be done incredibly fast. And we will cover these algorithms in the class.
© 2018 Coursera Inc. 保留所有权利。