0:10
So the Human Genome Project was first proposed in the late 1980s
by scientists at the US Department of Energy.
Many people don't realize it was not the National Institute of Health,
the NIH, that proposed it, but rather the, the DOE.
And the reason the DOE was interested in genomics was they were studying
effects of radiation on DNA.
But never mind that.
The project was proposed, initially, it was considered to be extremely ambitious,
and many scientists were actually against it.
It was, the idea was that it would be biology's Manhattan project, by far,
the largest project that biology had ever taken on.
But as scientists started to discuss it in the late 80s,
it quickly gained momentum and soon it was approved and and
the NIH joined in and then many other countries joined as well.
So the project officially started in 1989 as a joint effort of the NIH and
the DOE in the United States, plus many other countries.
Outside of the US, the Sanger Centre in England was the largest sequencing center.
1:05
So the goal of the project was very simple.
The human genome is 3 billion base pairs long.
In the 1980s, sequencing was still very new technology.
The automated sequencing that was available at the time was very slow and
expensive.
It was around $10 a base to sequence in the 1980s.
So that was really expensive, and that would cost $30 billion.
But the scientists who were proposing it said, well,
we know things are getting faster and more efficient, so
we're going to assume that pro, that prices will drop by a factor of ten.
And we'll probably be able to sequence the genome for $1 a base.
So they came up with an estimate of $3 billion,
which is the number that has been widely reported as what the project cost, and is
probably a bit of an overestimate because cost went down quite a bit more than that.
But anyway, that was the goal, sequence all 3 billion base pairs
in the human genome for $1 a base, and finish it in 15 years, by 2005.
Now, one reason people were opposed to this was that we knew already at the time,
and we certainly know very well now,
that only about 1.5% of your DNA encodes proteins.
So people said, well, most of the DNA, they thought was just junk,
was stuff that wasn't really biologically important or useful.
We now know that that's not really true, but at the time it was widely believed
that most of the sequencing of, would be dedicated to learning, to learning
sequences that didn't really have any biological function or consequence.
So some scientists who were opposed to it were opposed to it because they thought it
would be a waste of time and a waste of money that would be better spent
trying to target the genes in the genome.
Nonetheless, the project took off and quickly gained momentum.
2:28
So, you might have heard about the Human Genome Project as a race.
Well, in the early 1990s it wasn't really a race.
I'm going to get to that in just a few minutes.
So what, the way the project started was that scientists around the world worked
on what, what were called maps.
So the idea of the project and the plan for the overall project from the beginning
was that we would take large chunks of DNA, and these were about 150,000 base
pairs long, that were called bacterial artificial chromosomes or BACs.
So we take these chunks and
we could grow those chunks up in E.coli bacteria, make as many copies you wanted,
and we could sequence those chunks, and then stitch those pieces together.
So this seems like, and that was because at the time the best we could do in terms
of sequencing technology when we were sequencing DNA was to sequence
little tiny fragments from a slightly larger chunk.
And these bacterial artificial chromosomes,
or BACs, were about the largest chunks people thought they could handle.
So the, the real problem was could you assemble those little tiny fragments or
wreathes together, and we knew we could do it for 150,000 base pair chunks.
So the, the, the problem though is we have a bacterial artificial chromosome, if you,
it was easy to create these BACs, but
you had to figure out where they went on the genome before you sequenced them.
So the, the idea was to develop develop libraries they were called,
with hundreds of thousands or millions of BACs in them.
Then select those BACs, figure out where they went in the genome, and create
a tiling path, basically aligning the, the pieces of, of BAC DNA across the genome.
And then finally, when those maps were done, we would sequence the BACs.
And the idea was as mapping would go on, the funders would fund that effort, and
then sequencing, meanwhile, will get more efficient,
and when we finally got around the sequencing, it will all be $1 a base.
So that was the idea, and that was, that was moving along steadily throughout
the early 1990s, as well as technology development.
But then, something happened that kind of changed the game rather dramatically.
In 1995, a small non-profit research institute called TIGR, The Institute for
Genomic Research, sequenced the first complete bacterial genome ever to be done,
the genome of Haemophilus influenzae, which is an infectious bacteria.
This genome is about 1.8 million bases and had 7, has 1,742 genes.
And this project was led by Craig Venter, who was the founder of TIGR, and
Hamilton Smith, a professor at Hopkins, who also was a Nobel laureate,
is a Nobel laureate.
So what was different?
Why would this change things?
This is a tiny genome, bacteria are far smaller than human's,
about 1,000 times smaller.
What was different was that this was done through whole genome sequencing,
whole genome shotgun sequencing.
Where you didn't create these maps, but instead,
you took the whole genome, you fragmented it into tiny, into many, many tiny pieces,
tens of thousands of tiny pieces.
Then you just randomly sequenced those pieces, and by oversampling,
that is by sequencing every part of the genome many times over, you could
then use a computer program called an assembler to put it back together.
And people had never done this for
something on the order of a whole genome, even a whole bacterial genome before.
So this was dramatic and, and
certainly changed the field of microbial genomics at the time.
And everybody in the microbial world was very excited and
started proposing to sequence microbial genomes this way.
Meanwhile though, the human genome continued as planned, sequencing or
mapping these 150,000 base pair chunks.
So then things changed again a few years later in 1998,
and this is where the race really began.
So a new sequencing machine was developed by a company called Applied Biosystems.
And this machine was not dramatically more efficient than the other,
than the previous machines, but it was significantly faster, and it was easier.
It used capillaries to do sequencing.
That is tiny, tiny little plastic straws and the DNA would flow through those.
And it let you add, and
it let you automate the sequencing in a way that wasn't really possible before.
So with funding from Applied Biosciences Applied Biosystems Craig Venter,
Ham Smith and others left TIGR to form a for-profit company called Celera Genomics.
And this company's goal, its entire purpose for
being created was to sequence the human genome.
Not only were they planning to sequence the human genome, but what they proposed
at the time was that they would do it through whole genome shotgun sequencing.
That is, they would take the entire human genome, 1,000 times larger than
a bacterial genome, they would break that up in to lots of little pieces,
millions and millions of little pieces, sequence those, and
somehow assemble them back together to create the whole genome.
Now this method didn't have to do the mapping.
The mapping was still going on, the BAC mapping was still going on in the publicly
funded Human Genome Project, but Celera wasn't going to do that.
They were going to skip all that, and go straight to sequencing.
7:16
So this was really a race.
So despite that,
some people were skeptical about Celera's ability to assemble an entire
animal genome using this whole genome shotgun sequencing technique.
No one had really done it for anything larger than a,
than a large bacterial genome, which is millions of base pairs, not billions.
However, soon after the formation of the company, Celera sequenced and
published the complete genome of the fruit fly, Drosophila melanogaster.
Now, drosophila is about 130 million base pairs long,
so still much smaller than human, about 20 times smaller.
But much larger, about 20 times larger than any bacterial genome or
than any genome that had been sequenced and
assembled through the whole genome shotgun technique up to that time.
So that was a success, it was published in 2000, and it proved that this whole genome
shotgun technique would, could scale up by a factor of 20.
And there was really no technical reason why it wouldn't scale up by another factor
of 20, and in fact, that's in, that's what eventually happened.
So this really proved that Celera meant business, and
it really spurred the public effort to, to accelerate their, their work even further.
So the race really heated up in 1999 and 2000.
In 1999, Craig Venter announced that Celera would finish their work by 2001.
Actually, originally he really announced 2003 because the public effort said they
were going to finish in 2005.
The public effort quickly responded by saying they would also finish in 2003.
Then in 1999 Venter announced that Celera would finish, in fact, in 2001.
And soon thereafter, within a matter of weeks, NIH and the Sanger Centre announced
to the public, Human Genome Project would finish a draft genome by 2001 as well.
So everybody was racing.
Now, by the way,
these are what the scientific leaders of the projects were doing.
The actual people doing the work,
I was one of those people, were mostly just panicking.
because we didn't really have any plan to finish that quickly but
we figured we would have to give it a shot.
So in 2000 as, as the work really did seem to be getting close to completion for
a draft genome, NIH, the Sanger Centre, and
Celera Genomics talked about publishing jointly.
So there was a considerable effort to make this into one
final project that everybody would say we all did together.
However, in late 2000, those talks fell apart and two papers were planned.
9:28
So that's what happened.
In June of 2000 Bill Clinton and Tony Blair, the leaders of the US and
the UK at the time, jointly announced the completion of the human genome.
And you can see in the slide that Craig Venter is shaking hands with
Francis Collins,
who was the head of the Human Genome Research Institute at the time.
So it was announced in 2000 and, 2, in the year 2000, that both groups were done and
that it was a tie.
And that's kind of how, how it played out.
Now at this point, the paper wasn't done and in fact,
at the time of this announcement, the genomes weren't done either.
But we knew we had about six months to get them done,
those of us who were actually in the trenches doing the work.
And so everybody put all their effort into, into,
as quickly as possible finishing up this draft genome.
So now, whose genome did we sequence, by the way?
So, when you talk about sequencing, the genome, each of us on the planet, and
there are billions of us, each of us has a different genome.
Now, our genomes are all very, very similar, probably only differing by about
one position in a thousand, but they're all different.
The Human Genome Project sequenced one genome which was a mosaic of about a dozen
volunteers who contributed DNA, all anonymously to the Human Genome Project.
All of them were Northern European in origin, so
they all had a similar genetic background.
And the assembly of this original genome represents that one sort of mosaic of
a small collection of, of individuals of Northern European descent.
Now since then,
we've gone on to sequence the genomes of other people from other populations.
But at the time, that was what we did.
It wasn't one person's genome, but a few people's genome.
10:53
So what did the genome tell us?
Why do we do this?
So, one of the major goals of the human genome is the,
of the Human Genome Project was to identify all the genes.
Identify all their sequences, eventually figure out what they all do, and
use that to develop better treatments and improve human health.
So just, this is just one of the first papers, what I'm showing you here is one
of the first papers ever to attempt to estimate the number of human genes.
This is a paper that appeared back 1964 in the journal Nature, so
40 years before the Human Genome Project completed.
There was, this was actually soon after the genetic code was kind of worked out,
very soon after the genetic code was worked out.
And what this scientist named Vogel did was he, he looked at, the act, the,
the first two genes whose sequence had been determined were human hemoglobin
sub units.
And these are very kind of small genes,
they're about 146 amino acids long, and he knew how much they weighed.
We, we had pretty good measurements of how much those amino acids weighed.
We also had pretty good measurements of how much DNA weighed.
So you could say, well, 146 amino acids, so you'd know how much the DNA encoding
that weight is 3, 3 nucleic, nucleotides for each amino acid.
So you, and you could also measure roughly the weight of the genome in a cell.
So he basically took the weight of one gene, he divided it into the weight
of the genome, and he assumed that basically the genome was,
was just gene after gene, end to end, encoded there.
Now remember, this was 1964, no one knew, knew otherwise.
We now know that only about 1 to 2% of the genome encodes genes, but
that wasn't known at the time.
So, using this estimate and not knowing anything about exons and introns and
about all the intergenic DNA and so-called junk DNA,
you would come up with a number of around 6.7 million genes, which is wildly off.
So the whole Human Genome Project presumably would give us a much better
fix on this number.
We're, there's many many other things, of course, that we can learn from the genome,
but this is one of the kind of simple messages we should be able to get out of
the genome, is how many genes do we have and what are they.
12:48
So the two papers appeared in February 2001.
They actually appeared simultaneously, with lots of hoopla.
The public effort published its genome in Nature, and
here's a picture of the cover from the time.
And they estimated 30 to 40,000 genes.
Now, one interesting thing about that and that was the official estimate in
the paper, was that seems like a very imprecise number.
And 15 years ago, or 10 years earlier, no one would imagine that the genome
wouldn't finish and we wouldn't know precisely how many genes there were.
But it turns out that it's much harder to figure out exactly what the genes are from
the DNA sequence than, than we had realized.
So they gave a very rough estimate of 30 to 40,000, and that number
was a lot smaller than estimates that had been discussed even as recently as a year
earlier when people were still proposing 100,000 genes in the human genome.
So that was a surprise that there were so few.
The other paper was published in Science,
this is the paper led by the group at Celera Genomics,
which included scientists from around the US and, and Europe as well.
And that number was much more precise-seeming, 26,588 genes.
But there was an additional approximately 12,000 likely genes
that were also described in that paper.
So those two numbers were pretty, so
if you add those together it's in the high 30,000 range.
So that, that was consistent with what was in the Nature paper.
And let me just mention that I was on this paper,
buried in this long author list as well.
14:31
So that, that, that gene count number is an interesting story in itself.
So if you look at how it's evolved over the past 40 years since, or now 50 years,
since that, that early 1964 paper, which estimated millions of genes,
starting around 1990, people were,
were bandying about estimates generally in the 100,000 range.
But this chart shows a number of publications that kept moving that number
around starting at 100,000 going down to,
there were estimates published in the 60,000 range, the 50,000 range, and so on.
And it gradually decreased until today we believe the number is around 22 to 23,000.
But we're still not sure of the precise number, even today,
we don't have a precise number of human genes.
And an important caveat here is, this gene count the way it was originally proposed
and the way I'm describing it now, refers to the number of protein-coding genes.
That is a, a piece of human genome,
a piece of DNA that gets transcribed into RNA, gets translated into a protein.
Everyone agrees we call that a gene.
However, for at least the last 20 years,
we've known that there's some number of genes in your genome
where the DNA gets transcribed into the RNA, and the RNA itself has a function.
We would call that an RNA gene, it never gets translated into a protein.
In the late 2000s, starting in the late 2000s,
through a new technique called RNAC, we've learned that there are many thousands,
really probably tens of thousands of RNA genes in the genome.
So this gene count that we've been looking at for the past several decades
only looks at one side of the coin, we're only looking at protein-coding genes.
We know the number's quite a bit higher if you look at RNA genes.
And that number is even less precise today.
Over the, over the coming decade, we probably will get a better handle on it.
But I wouldn't promise we'd have a precise answer to this question
even ten years from now.
16:11
So let's just review.
The Human Genome Project started in the late 80s.
It officially began in 89 and
1990 with the goal of sequencing 3 billion base pairs.
Did they achieve that?
Yes, they did.
The goal was to do it for $1 a base.
How did they do on that?
Well, at the time the genome was published, it cost about a $1 for a read.
A read was a single sequence coming off a sequencer, it was about 700 bases long.
So not only did they achieve that goal,
they were 700 times cheaper than they thought it would be.
So that was really a dramatic success.
They were, the in terms of time, they wanted to get it done by 2005.
It was done in 2001, at least the papers were published in 2001, a draft genome.
So you could say they dramatically exceeded all of their goals.
And now, the cost today, the fi, the final note is that the cost in 2001,
$1 per read, $1 per 700 bases,
seemed quite, quite dramatically better than we'd expected.
And it did continue to slowly, to slowly drop for a few years thereafter.
But then, due to dramatic changes in technology, what's called next generation
sequencing, the cost has dropped far, far more than that.
So today, it costs about $1 for 3 million bases,
which is 4,000 fold cheaper than it cost even when the human genome was finished.
And that's what's led to this tremendous explosion in all
sorts of genome sequencing experiments that we're, that we're experiencing today.