An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

来自 Johns Hopkins University 的课程

Statistics for Genomic Data Science

131 个评分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

从本节课中

Module 3

This week we will cover modeling non-continuous outcomes (like binary or count data), hypothesis testing, and multiple hypothesis testing.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

Permutation is one of the most widely used tools for

assessing statistical significance in genomic studies, so

I thought I'd explain the principal behind permutation and

then we'll talk about how you calculate permutation p-values in a future lecture.

So here we're looking at the erythroid differentiation signature

that predicts response to lenalidomide In Myelodysplastic Syndrome, and again you

have to be a little impressed that I got some of those words close to right.

So in this study, you again have some genes that you measured, so

gene expression that you measured, in many of these rows, and

you have patients that you've collected, so there's a number of patients.

And some of them are responders, and some of them are nonresponders.

And so the idea is that you're going to be comparing the responders to

the nonresponders for every gene.

And you might calculate a statistic.

For example, you might calculate the T statistic comparing the difference in

mean expression level for the responders and the not responders divided by or

standardized by some measurement of their variability.

So now that we have this statistic,

we want to know how extreme it would be if there was no relationship at all.

And so one way to break the relationship between the response and

the gene's expressions levels is to permute the labels.

So one thing that we could do is we could randomly scramble the labels.

So that's what we've done.

When we moved from the left hand side to the right hand side here,

we just completely scrambled the labels totally randomly.

And when you do that, there should now no longer be a relationship between

the response variable and

the gene expression measurements, because we've assigned the labels at random.

And so it turns out that this is a good thing to do to permute the labels

than permute each gene expression level, because it leaves the gene

expressions levels, the relationship between those levels connected.

And that's good, or it leaves those intact because you might need to model

that relationship later on in the modeling process, which we'll talk about later.

And so the idea here is that you do this permutation and

then you recalculate a statistic for each gene.

So if you calculated the original statistics, say, for gene one, and

it was equal to 2, that would be where the original statistic is.

Then you permute the labels every time, and you recalculate the statistic.

You hope that it would be centered near 0, because there should be on average,

no difference between the two groups once you permuted the labels.

And you can see how extreme this statistic is with respect to those permuted

statistics.

And if it's really extreme you might think, oh, well then it's not likely that

this statistic comes from this distribution, and if it's not very extreme

you think, oh, well it might be coming from that distribution.

So this permutation idea is used all the time in genomics.

It's used not just for the simple comparisons but for network comparison,

for enrichment comparisons all of the time all over the place.

And it assumes that if you switch the labels the data come from the exact same

distribution.

So by permuting the labels we're sort of making the assumption that the labels

don't matter.

That that gene's expression levels are completely independent of the labels.

And it's not necessarily just a comparison of means.

So that permutation statistic we calculated,

the T statistic, is calculating a distance between the two means.

But by permuting the labels, we're actually making that distribution,

we're assuming that the distribution is exactly the same.

So that T-statistic will actually find any difference if you do this permutation

approach, any difference including in the variance or

any of the other moments of the data of the generating distribution.

So permutation is actually quite a complicated topic.

We've covered it just very briefly here,

we'll cover it a little bit more in the assessments.

But you can learn a little bit more about it in this advanced statistics for

the life sciences course.