0:00
Throughout most of this course, we've talked about statistical inference for
genomics, but there's also this idea of statistical prediction or
machine learning for genomics.
So recall that the central dogma of inference is basically that we're going to
have this population and
we want to use probability to sample from that population.
So once we get that sample,
we're going to try to say something about this global population.
So this is sort of a population level analysis.
By contrast you can think of sort of the central dogma of prediction, is you take
some sample from a population again and you build that into a training set,
where you have two different kinds of things that you're trying to predict,
and then you use that data to build a prediction function.
And so once you have that prediction function, if you get a new sample and
you don't know what color it is, that function assigned it to one of the two
colors based on some of the properties.
And so prediction is a little bit different problem than inference and
we haven't covered too much about it, but wanted to cover just a little bit about
some of the key issues that often come up related to prediction in genomics.
0:55
So the first thing to keep in mind, is that inference and
prediction can give you very different answers, totally sensibly.
So here's an example, suppose we want to test for
the differences between the values between two different distributions,
and we collect a whole bunch of data.
If you do inference, and you ask are these two populations different?
In this case, they're definitely different from each other,
the distributions are very different from each other.
But they're not necessarily very predictive.
So imagine that I wanted to predict which of the two distributions the data
point came from.
If it came from sort of out here, you might be able to predict, oh,
it's maybe a little bit more likely to come from the light gray sample than
the dark grey sample.
But if it came from here, it's not very predictive at all.
It's sort of could be very likely to be either the dark grey sample or
the light grey sample.
1:41
On the other hand, this is another case where inference would definitely tell you
that there's a difference, just like it would in the previous example, but
here it's much more predictive.
Basically if you have any data point out here,
it's going to be easily assigned to one of the two distributions.
But a data point here in the middle might not necessarily be assigned, but
that's just a very small fraction of the cases.
So the first thing to keep in mind, is that in the case of inference we're
looking for differences that may or may not be predictive.
So if you do, say, a differential expression analysis,
you might identify lots of differences.
Many of those might not necessarily be good for prediction.
So the other thing to keep in mind is the quantities of interest.
So suppose that you're doing genomic tests and you have some disease that you
want to test for, then the quantities that you care about are the case where the test
says you have the disease and you actually do, that's a true positive.
Or the test says that you don't have the disease but
you don't have the disease, that's a false positive.
2:34
Or the case where the test says you do not have the disease and you actually do,
that's a false negative.
And then the case where the test says you do not have the disease and
you actually don't, that's a true negative.
So usually people in the genomics talk a lot about false positives when they're
talking about inference.
And they also talk about true positives.
But in prediction you need to sort of carefully balance how these
different potential categories work.
So here's a really simple definition of some of the key quantities of sensitivity.
You might hear about the sensitivity of a test.
That's the probability that you get a positive test given that
you actually do have the disease.
Specificity is the probability of a negative test given that you don't
have the disease.
And then the positive predictive values is, so if I do have a positive test,
how is it likely that I actually have the disease?
Same with the negative predictive value.
And then the accuracy is just the probability that you the correct outcome.
That's sort of the sum of the true positives here and
the true negatives here divided by the total number of cases.
And so, here you're going to again define all of these sort of things in terms of
the true positives, false positives, false negatives, and true negatives.
So, for example, sensitivity is the TP / (TP+FN).
So these definitions that I'm showing you here, in terms of these quantities,
correspond to the probability definitions that you saw on the previous screen, so
that probability of a positive test given that you have the disease.
Here, (TP+FN) here, so the (TP+FN)
are all the cases where you have the disease.
And then you're looking at the fraction of the time where you actually identify them.
So that's TP / (TP+FN).
4:24
Okay, so let's use an example.
This just sort of illustrate how any kind of screening can be tricky, but
particularly geometric screening can be tricky.
So, assume that there is a disease in it.
Only about .01% of the population have that disease.
And so we have a test that's 99% sensitive.
That is, if you have the disease with 99% of the time,
it will say you have the disease.
And it's 99% specific.
So, that means that if you don't have the disease, then 99% of the time,
when you don't have the disease, it'll say that you don't have the disease.
So, that seems like a pretty good test.
So, the question is, what's the probability
of a person having the disease given the test result is positive?
In other words, what's the positive predictive value of this test?
So we're going to consider two cases, a general population where the rate of this
disease is 0.1% and then a higher at-risk sub-population.
So in the general population this is what it might boil down to.
So remember, it's a very accurate test.
So if you have the disease, 99% of the time it'll tell you that you have it.
And then if you don't have the disease,
99% of the time it'll tell you that you don't have it.
But these numbers are a little bit sort of unbalanced because almost no one has
the disease.
It's a highly rare disease, so if you actually go calculate the sensitivity and
the specificity, they're both very high, just like we expected.
But the positive predictive value is only 9%.
Why is that?
It's because your testing a huge number of people that don't have the disease, so
even though you only get a tiny fraction of those to be false,
there's a large number of them because you tested so many.
So, it turns out that the positive predictive value or
that the probability you actually have disease, if we tell you you have
the disease, is only 9%, which might not be that great for lot's of reasons.
One we might give you all sorts of treatments you don't necessarily
want to get.
For two, you might be nervous or scared because we told you you have the disease,
even though it's actually kind of unlikely that you have the disease.
Even though the test is, in this case what a lot of people consider to be a really,
really sensitive and specific test.
Now except for sort of rare disorders and some very specific variations, it's
very rare that you would get these numbers to be this high in a genomics experiment.
Typically the sensitivity and
the specificity are relatively low compared to what we're showing here.
6:42
And, so this effects even sort of what people consider to be really used in
quite strong screening tests.
So, for example, when you're looking at mammogram screening,
particularly in young women or same thing for prostate cancer.
If you're sort of doing PSA screening in younger men, it turns out that when you do
this sort of screening, even though the test might be pretty good,
you're just testing so many people, most of whom who don't have the disease.
You'll get lots of false positives,
which will lead to, sort of potentially consequences.
In particular the consequences tend to relate to how much money people
spend on downstream therapy, and how much difficulty they go through for
downstream therapy.
So one way to address this, particularly this is useful for genomics but
also other areas,
is to basically go to a population where there's a higher risk of the disease.
So in this case now, we have again a 99% sensitive and
specific test, but now we've gone to a situation where we're at
a higher risk of that disease in the population overall.
So you can see, now there's 10,000 potential people who have the disease,
compared to 100,000 who don't necessarily have the disease, so
the frequency of the disease in the population is higher.
And so if you do goes calculations again, now you have a 99% sensitive and
specific test.
But you don't get overwhelmed by the fact that there's so
many more not diseased people that diseased people and
your positive predictive values stays pretty high.
So this is one example of the ways that it can be a little bit tricky to do screening
or to use genomics measurements for prediction.
8:28
And then this is sort of the idea that's underlying precision medicine.
Some of precision medicine is focused on sort of rare diseases and
Mendelian disorders, where it's a little bit more targeted, and
you tend to get much higher sensitivity and specificity.
But particularly for precision medicine for common complex diseases,
these are the issues that will come up.
And so far this has sort of been a major challenge for
genomics, it's sort of an open area where a lot of people are working.