An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

来自 约翰霍普金斯大学 的课程

Statistics for Genomic Data Science

116 评分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

从本节课中

Module 4

In this week we will cover a lot of the general pipelines people use to analyze specific data types like RNA-seq, GWAS, ChIP-Seq, and DNA Methylation studies.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

Once you fit a statistical model and you've identified those genes or

those features that are statistically significantly associated

with the phenotype you care about after correcting for multiple testing,

you might want to identify if there's some biological pattern to those genes or

to those features that you've identified that are differentially expressed.

So again I'm going to go back to this example where we're trying to predict

the response to Lenalidomide from Myelodysplastic Syndrome.

So again we find these genes that are 47 Genes that

are differentially expressed at a false discovery rate of 10%.

And so you can see for example that they're appears to be some

genes that have something in common here near the top of this list of

differentially expressed genes but is there a way that we can quantify that?

So one way that you can do that is you can take the statistic for

every gene that you calculated, and you can order them from largest to smallest.

Alternatively you can take the smallest p value to the largest p value.

And so over here are the most statistically significant associations and

over here are the least statistically significant associations.

Then you can take some gene set that you care about and

label all the genes that are in that gene set.

In this case, I've made them red.

So what you can do is then you can calculate a running statistic

that goes up every time you have a gene in the gene set and

goes down every time you have a gene out of the gene set.

And so what you can see is, if all of the genes that are in the gene set cluster

near the most statistically significant values, then you'll see much more

values that go up than values that go down, and you'll get a high peak here.

And so the statistic here is actually a max deviation from zero.

That's the gene set enrichment statistic.

This is related to something called the Kolmogorov-Smirnov statistic if

you know a little bit more about advanced statistics.

And so the idea here is that we want to identify, is this enrichment

statistically significant if it's more than we would expect to see by chance?

So one way that people do that is they again permute the sample labels.

We've permuted the responders and the non-responders.

And now we get the new set of labels.

And so, once we get the new set of labels, We can recalculate the statistics and

reorder them.

And so now that we see the genes that belong to the gene set are a little bit

more scattered throughout this profile and so

you see that the profile goes down and then up and then down and then up.

It wiggles a little bit more but it doesn't deviate from zero as far and so

there appears to be less of an enrichment of those values.

So you can recalculate for several permutations the value of this gene set

statistic, and then you can calculate again a P-value for each gene set category

as to whether the permuted values are more extreme than the observed value.

And so you can calculate a P-value for Each of the gene sets and

then again do a false discovery correction and identify gene sets that are associated

with those statistically significant results.

So what are the gene sets you can look at?

The Gene Ontology Consortium has a large ontology of gene sets that are based on

their function and based on their spatial location within the cell and so forth.

You can also look at molecular signatures that have been curated.

For example this set of molecular signatures that you can get from this

MSigDB database.

Or you can look at things like interactions between proteins and

then see is there an enrichment for a particular set of interactions among

the genes that you found to be differentially expressed.

Really its any previously defined set of genes that has some

function that you care about you can use for a gene set enrichment analysis.

So one thing to keep in mind is this can be very hard to interpret especially if

the categories are broad or vague.

So for example, if you get a category that comes out as transcriptional regulation,

that's a very broad category, there's lots of different subcategories of that.

And so if that's enriched, it's not clear how much added value it's giving you.

It's better if you can find specific, concrete categories that are enriched.

Here, if you're not very careful you can tell stories, so

again you have to correct for the multiple testing problem and

you have to be very aware of your own implicit biases.

This incurs a second multiple testing problem like I said compared to

just the multiple testing problem involved in identifying differentially

expressed genes.

Now you're multiply testing multiple sets and so you have to account for

that as well.

This idea can actually be simplified.

The statistic I showed you here, this gene set enrichment

statistic can be simplified into basically a very simple T statistic

comparing the genes that are in the set to the genes that are out of the set and so

you can read about that here in this paper.