An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

来自 Johns Hopkins University 的课程

Statistics for Genomic Data Science

124 个评分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

从本节课中

Module 3

This week we will cover modeling non-continuous outcomes (like binary or count data), hypothesis testing, and multiple hypothesis testing.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

And in genomic studies since you usually don't fit one regression model,

you fit many regression models,

you also don't calculate just one p-value, you calculate many p-values.

And so you quickly run into the problem of multiple testing.

If you use the standard cutoff of 0.05 when calling p-value significant,

you'll know that since p-values are uniformly distributed,

about 1 out of every 20 times you'll

still call that result statistically significant, even though it might not be.

This was best illustrated by an xkcd cartoon.

So imagine that scientists are trying to investigate whether jelly

beans cause acne.

So the first thing that they might do is they run an experiment and

they try to detect do jellybeans cause acne?

Well, when they first do the experiment they just compare jellybeans and

acne in a global population and find that the p-value's greater than 0.05,

so they can't call it statistically signficant.

But then what they could do is they could try all the different colors,

they could try purple jellybeans, brown jellybeans, pink jellybeans.

Eventually, if they try up to 20 different cases, they expect

that at least one of those will have a p-value that's less than 0.05, on average.

And so, surprise, surprise, you find that green jelly beans are linked to acne.

So this is one of the reasons why you might imagine seeing a lot of

strange headlines in the news where you find statistically significant results

that actually probably aren't real.

This could be due to the multiple testing problem.

So imagine you measure 10,000 genes, and you calculate 10,000 p-values.

If you called every statistically significant gene

significant if it had a p-value less than 0.05,

you would expect to get about 10,000 * 0.05 = 500 False Positives.

So that seems like a lot, and

particularly if you only found say a couple of hundred genes to be significant,

then you would expect almost all of those to be false positives.

So doing this p-value threshold of 0.05 is not necessarily

going to be useful when doing a genomics experiment.

So instead, we define a couple of different error rates that are more

commonly used in genomics.

The first is the family wise error rate.

So this is the probability of finding even one false positive statistically

significant result if you've done, no matter how many tests that you do.

And so this is a very stringent error rate, it's basically requiring that

you not have more than one False Positive among all of these different tests.

Obviously that's going to be much more stringent than just calling p-value less

than 0.05 significant.

Another is the false discovery rate, and so this tells you something about

the noise among the discoveries that you're willing to tolerate.

So this is the expected # of False Discoveries divided by

the total # Of Discoveries, and so

you think about it like this, imagine that you get a false discovery to 5%.

That means you expect about 5% of the total number of discoveries that you've

made to be false positives.

So this helps you quantify sort of the noise level among the discoveries that

you've actually made, rather than noise level quantified versus the total number

of things that you're testing.

So, here's the difference in interpretation.

Suppose that I tell you that 50 out of 10,000 genes

are statistically significant at the 0.05 level.

Depending on what 0.05 level I've used, you get a different interpretation.

So if I just say out of the 10,000 genes I set

called all genes statistically significant that had a p-value less than 0.05,

then I expect 0.05*10,000 = 500 false positives, as we just talked about.

And since I only found 50 genes significant,

that doesn't seem like it's very good, there's probably mostly false positives.

On the other hand, if I found these fifty at a 0.05 false discovery rate,

then I expect there to be about 0.05 times the total number of discoveries I've made.

Because, remember the false discovery rate quantifies the fraction of discoveries

that you've made that are likely to be false.

And I get about an estimated 2.5 false positives among all of the genes

that I've called significant, so that's maybe more tolerable rate.

If I use family wise error rate, and I say at the 0.05 family wise error rate,

I found 50 genes that were significant, then that means that the probability

I'm controlling the probability of even one false positive to be less than 0.05.

And so basically I expect almost all of these 50 genes to be

statistically significantly different.

And so the way that you quantify these things in practice is with

the Bonferroni Correction for the family wise error rate, and

the Benjamini-Hochberg Correction for the false discovery rate.

So for the family wise error rate, the basically the way that you do it is

suppose that you were using a threshold of 0.05 for your p-values,

you use the threshold of 0.05 divided by the total number of tests you did.

So if it's 10,000 tests, you do 0.05 divided by 10,000 and

then you call all p-values less than that threshold significant.

That will control the family wise error rate.

To control the false discovery rate,

suppose I want to control the false discovery rate at level 0.05,

at level alpha here, suppose alpha was equal to 0.05.

Then what I do is I order the P values from smallest to largest, So

P1 in parentheses is the smallest p-value, and P(m) is the largest p-value.

And then I say the p-value is significant if the P(i) value is less than

the false discovery that I'm looking for, say 0.05*i/m.

So this is a linearly growing function with the number of tests that I perform.

So I can show you a quick example of that with just ten p-values, so

you can get an idea of what these cutoffs look like.

So just so it makes it a little bit easier to see,

I'm going to be using a p-value cutoff of 0.2, rather than the usual cutoff of 0.05.

And so here you can see the cutoff at 0.2, and so, if I use a cutoff of 0.2,

I'm going to call just these four p-values significant.

So I've got ten p-values here, ordered from smallest to largest,

and these four are called significant at a uncorrected cutoff level.

If I do a Bonferroni cutoff, then I'm going to take 0.2 and

I'm going to divide it by the total number of tests I get.

So I have a cutoff of now 0.02 and

then I only call these two down here statistically significant.

Finally, if I want to do the Benjamini-Hochberg correction,

I can use the false discovery rate correction by taking the number 0.2 and

then multiplying by i, divided by m.

And so that turns into this line right here, which grows from

something down here at the small end is 0.2 times one, divided by ten,

so that's 0.02, which corresponds to the family wise error rate correction.

It grows to 0.2 here at the final statistic,

because that's 0.2 times n divided by m.

And so here, you can see I call three things significant.

So this is the typical distribution that you see.

If you use an uncorrected p value, you call the most things significant.

If you use a false discovery rate correction,

you call the second most things significant.

And if you use a family wise error rate, that's the most conservative.

But again, remember for each of these different cutoffs I get a totally

different interpretation fo the number of false positives, so

the error rates are actually different.

So you get, type I errors are different than the family wise error rate,

which is different than the false discovery rate.

And since they measure different things,

you get slightly different numbers of things that you call significant.

So typically just using the raw uncorrected p-values is only done in

genomics experiments if you've only measured one thing,

and that's very rare in genomics experiments.

Family wise error rate is typically used when you expect to only find a small

number of discoveries that you really have to be sure that they're

absolutely most likely to not be false positives.

So that's typically used in genomewide association studies, in genetics,

in whole-genome association studies.

False discovery rate is typically used for quantitative measurements, things like DNA

methylation or chip sequencing, or if you measure gene expression variation.

And so that's typically used the case where you expect to see

a relatively large number of discoveries, and

you want to quantify the percentage of them that are false positives.

All of these methods rely on the p-values being correct.

In other words, that you fit an appropriate model.

So if you haven't fit the right model, say the linear model isn't a good fit, or

you haven't accounted for batch effects when they exist, or

you haven't accounted for any of the other affects, then the p-values won't be right,

and then none of these methods will actually work well.

And so, as a great first read, and a great first introduction to these types of error

rates, I've linked here to a paper on a genomewide significance paper

that is a really good introduction, and a gentle tutorial,

to this idea of multiple testing in high dimensional settings.