An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

来自 Johns Hopkins University 的课程

Statistics for Genomic Data Science

176 个评分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

从本节课中

Module 3

This week we will cover modeling non-continuous outcomes (like binary or count data), hypothesis testing, and multiple hypothesis testing.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

One of the most common things that you do when doing a genomic data analysis is

you want to compare the fit of two models.

So, let me explain why that's most commonly come up.

So, imagine that you're fitting a regression model.

So, Y is some genomic measurement that you care about, and

you're fitting a relationship here that's a linear relationship of a phenotype and

say a batch variable that you're going to be adjusting for.

So, when you're fitting a regression model for genomic data, you're often not fitting

one regression model, you're often fitting many regression models.

And remember this colored blocks version of that regression model framework.

So here we have a genomic data set where going down the rows here are the genomic

features we measured and going across columns are the samples we have.

And so here in one particular row are the measurements say for one gene.

And we model that as a function of the coefficients for

that gene, multiplied by some primary variable, say the phenotype that you care

about plus some other set of coefficient times some adjustment variables.

Here this might be the batch effects, plus independent random variation.

So what we're trying to do here is we're trying to identify,

is there an association between these primary variables and the genomic data?

And so we might imagine fitting a couple of different models.

The first model would be that one,

where you've included all the different variables.

The null model in this case might be the case where there's no relationship between

the primary variables and the data, and so this term goes away.

In that case you end up with a simplified model that looks like this.

You're modeling the gene expression or

the other genomic data as a function of the batch variable alone.

So the phenotype is no longer included in the model.

And here we get different estimates, so I've indicated those with stars

because we're fitting a different model in this case.

So this model looks like this with the colored blocks.

You can see for a particular gene which is a row in this matrix here.

You get a particular row of coefficients times some adjustment variables

plus the random variation for that row.

But you no longer have the phenotype data because that's

not included in the null model.

So now we want to compare these two model fits and

see if one does better than the other.

So as a simple example, I'm going to look at a case where we're looking at various

different samples that are assigned to three different categories for

cancer status.

Biopsy, cancer, and normal.

And so for each of these different categories you get a gene expression

measurement for each gene.

So for one of those genes, I'm plotting the measurements across samples, so

each dot is a sample here.

So you can see, you can, the null model might be, say for this case,

that they're all exactly equal in expression.

And so the mean level is the same for all of them.

And so, in that case, you would fit a mean model that's just the average across

all of the different samples.

The alternative model then might include the cancer status in the model.

The phenotype in the model.

And so now you get a mean level for each of the different cancer statuses.

So the question that we might ask is, does this model,

the model where we include cancer status, fit much better than this model,

the one where we just fit an overall average?

So the question is how do we come up with the statistic to quantify that?

So one way that you can do that is with the residuals.

So here, the residual is defined as the actual data point, but

subtracting the model fit.

So in this case, model one is, we're, actually the alternative model we're

going to fit, is the one that includes both the phenotype and the batch variable.

So we're going to fit that model and we're going to calculate the residuals.

Basically subtract the model fit from the observed values.

For model two, we can do the same thing.

So we can take the observed data and

we can subtract off the model fit where we only included the batch variable.

So now we have two measurements of how close are the model fits to the real data.

You can actually look at that in the data example that I've shown.

So here I'm actually showing you the residuals for

the model fit where you just fit an overall mean, and

the residuals where you fit, case where you fit a mean for each different level.

So, here you can see that the residual sum of squares or

the distance from the actual values to the model fit value squared,

when you sum it up over all of the different values is equal to 22.05.

And here when you actually fit the model that includes a term,

an average in each of the different cancer statuses,

you actually get a lower value of this residual sum of squares.

Now, you should always expect that.

Whenever you include more variables in the model,

the residual sum of squares will go down.

It just doesn't go down enough to make us suggest that the model fit

is a better fit.

So this is the statistic that people use to quantify that.

So this is the commonly used F statistic.

So here are these RSS terms are the residual sum of squares where it's either

under the null model if there's a 0 or an alternative model if there's a 1.

So here we're taking the difference between the sum of square fits for

the null model and the alternative model.

And what does that mean?

So we know that the residual sum of squares under the null model,

the model where we don't include the phenotype variable,

will always be bigger because we've included fewer variables in that model.

So this term on the top is always positive and it quantifies basically how much

the residual sum of squares shrinks by including that variable in the model.

Then we've standardized that to the units of the alternative model fit.

So basically we're saying the alternative model fit

includes all the variables we might fit.

How many units of residual sum of squares

did we change by going from this null model fit to the alternative model fit?

And then there's this term over here in the front and

this basically quantifies the difference in the number of variables.

And this standardizes it so that when we change the number of variables we include

in the null and alternative models it standardizes this difference here.

So again here we have n the total number of samples that we have.

p1 is the total number of parameters in the model, the alternative model.

And p0 is the total number of parameters in the null model.

And so you can see here, we're dividing this difference here,

RSS0 minus RSS1 by the difference in the number of parameters in those two models.

And similarly, we're doing something down here.

We're standardizing this RSS1 by the total number of observations

minus the total number of parameters in that model.

So this standardization allows us to come up with a standard form for

the distribution of this statistic, just like standardizing to standard deviation

units for the t statistic, gave us that similar form.

So we can use this statistic to quantify, does the model fit for

the null model, fit substantially worse than the null model fit for

the alternative, or the model fit for the alternative model?

So again this can be moderated just like the t-statistic can.

So sometimes that residual sum of squares of alternative

model is actually like very small.

And actually that happens quite a bit in genomic data and so

you can again moderate the statistics so that you don't get the blow up of

the statistic for very small variability there.

Again, the Linear Models for Microarray Data paper has a very good

explanation of how these linear models are fit and

what a net statistic is as does the edge package vignette in Bioconductor.

So the edge package is one package that's very useful for

fitting these null and alternative models and comparing them,

especially if you don't care about the moderated part of the statistic.

This is just like the direct comparison that can be done with the edge package,

and the vignette very carefully explains how these two models are fit.