0:00
Hello.
I haven't seen you for a long time, since we were working on the change problem.
But today we'll work on a completely different topic called String Algorithms.
String Algorithms are everywhere.
Every time you spell check your documents, or Google something,
you execute sophisticated String Algorithms.
But today, we walk about very different application of String Algorithms.
Sam Berns gave a fantastic pep talk when he was 16.
He was talking about his life and a year later, he died.
0:54
But for many years biologists have no clue of what causes progeria.
But in 2003, they figure out that progeria is caused by single mutation,
on chromosome one.
To understand what pattern matching has to do with progeria,
we need to learn something about genome sequencing.
1:16
When my children were young,
that's how I was explaining them how genome sequencing work.
I was using an example of the newspaper problem.
Take many copies of the identical New York Times newspaper,
then set them on a pile of dynamite.
1:35
Don't do it at home, and then wait until explosion is over, and
collect the remaining pieces.
Of course many pieces will burn in the explosion, but some pieces will remain.
And so your goal is to reconstruct the content of the New York Times.
A natural way to solve the newspaper problem,
is to consider it as an overlapping puzzle.
Look at different pieces and try to mix them together like this.
2:06
And then slowly but surely, hopefully you'll be able to assemble the whole gem.
And that's roughly how the human genome was assembled in 2000.
And here, Bill Clinton is congratulating Craig Venter,
one of the leaders of the Human Genome Project,
on completion of this $3 billion mega science project.
We don't need to know much about genomes for the rest of these talks.
The only thing we probably need to know is a genome is simply
a long strand in A, C, G, T alphabet.
2:43
I will try to explain how the newspaper problem translates into genome sequencing.
They start from millions of identical copies of a genome.
Then they break the genome at random positions using molecular scissors.
These molecular scissors don't look quite like this one shown in this picture.
Then we generate short Substrings of the genome called reads,
using modern sequencing machines.
Of course, during this generation some reads are lost.
And the only thing left is to the assemble the genome from millions, or
maybe even billions of tiny pieces.
The largest jigsaw puzzle humans ever attempted to solve.
Today, I won't be able to tell you about algorithms for genome assembly.
But if you are interested in learning about this algorithm,
you can attend or Bionformatics specialization at Coursera.
Or read the book, Bioinformatics Algorithms.
Assembling human genome was a challenging $3 billion project and
afterwards the era of genome sequencing began.
But in the first ten years after sequencing human genome,
biologists were able to sequence only about ten
other mammalian genome because it was still difficult.
4:23
Why do biologists sequence thousands of species?
There are many applications.
For example,
the next big science sequencing project after the human genome was mouse genome.
Because we can learn lot about Human biology and diseases from mouse gene.
An important application in agriculture, for example by sequencing rice genome,
biologists are able to develop new high yield crops of different plants like rice.
And there are many hundreds and hundreds of other applications.
4:55
But recently, in addition to sequencing of many species,
there is also much effort on sequencing millions of personal genomes.
The things that make us different are mutations.
However, there are surprisingly few mutations
that distinguish my genome from your genome.
Roughly one mutation per thousand nucleotides.
However, this mutation make big difference.
They account for height or
they account for more than 7,000 known genetic diseases.
5:32
Five years ago, the era of personalized genomics started and
Nicholas Volker is a foster child of personalized genomics.
He was so sick that he went through thousands of surgery, but
doctors still were not able to figure out what is wrong with this kid.
However, after he is genome sequence can reveal a mutation
in a gene linked to defect in the immune system.
Doctors applied immunotherapy and Nicholas Volker is now a healthy child.
However, sequence in personal genomes,
from scratch, still remains difficult even today.
What biologists do today however, they do so
called reference base human genome sequences.
Let's start from Craig Venter genome assembled in 2000,
call it reference genome.
And then let's start sequencing my genome by generating all reads from my genome.
Here's some of the three perfectly match to genome, but some of them don't.
And based on these reads that do not match,
we will be able to figure out what is my genome.
For example, we can find a mutation of T into C and
deletion of T in my genome as compared to.
It brings us to a number of computational problems.
The easiest one is the exact pattern matching.
Given a String pattern and a String text, we want to find all positions and
texts where pattern appears as a Substring.
7:13
But our goal is to find mutations, and
therefore we want also to solve Approximate Pattern Matching Problem.
Where input is a string Pattern, a string Text, and an integer d.
And we want to find all positions in Text where Pattern
appears as a Substring with at most d mismatches.
7:51
To answer the question where do billions of reads generated for
me, match the reference genome from.
And this leads us to Multiple Pattern Matching Problem given a set of strings,
Patterns and a string Text, find all positioning texts where
a string from Patterns appears as a substring.