An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

来自 Johns Hopkins University 的课程

Statistics for Genomic Data Science

123 个评分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

从本节课中

Module 3

This week we will cover modeling non-continuous outcomes (like binary or count data), hypothesis testing, and multiple hypothesis testing.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

Often the outcome isn't a continuous variable, and so

fitting a standard linear regression model isn't the right way to go.

I'm going to talk about two specific examples here.

But the first I'm going to do is, as always, I'm going to set up my

I'm plotting parameters to be like how I like them to be,

and then I'm going to load some libraries that we're going to need.

Specifically here, we're going to be looking at the snpStats and

the EC2 packages, but the broom package will be helpful for cleaning things up.

And so, the first thing I'm going to do is,

I'm going to load this snpdata that we have from the snpStats package.

And again, I'm going to subset to a smaller set of samples so

that I can manage it a little bit easier for this example.

And so I'm going to take that subset and I'm going to reduce the data set size.

Okay, so the first thing that I was going to do,

is I was going to compute the principle components,

just like I did previously, this takes a couple of immediate steps.

So first I do the right transformation of the data to get the xx transpose component

and then I calculate then eigenvalues or eigenvectors of that data set.

And then once I've done that, then I can calculate the PCs.

So now I've got the PCs matrix.

So this is what I'm going to be using for adjustment.

We learned about that in a previous lecture.

So now I can subset out, I removed this snpdata so

it's actually stored in RF format that's a little bit easier to process on.

But you might want to, for the purposes of this, just take the data set out so

that we can deal with it on a scale that's a little bit more manageable for us.

And so the first thing I'm going to do is, I'm going to look at that snpdata, so

I have data on the following snps, for the following people, and

then I can get the status.

So the case control status for the people is in the subject support data set.

So that's the case control status.

So I have the status for the thousand people there.

And so then I can take the first snp and I can extract the data for that snp.

So now I have the first snp and the status for that person.

And I want to fit a generalized linear model relating the two things.

So the first thing I need to be aware of is that the way that the snpdata is coded,

shows that there's some 0 values.

0 values are actually meant to be missing.

So I'm going to recode those values as missing values, so

R will know what to do with them.

And then I can fit a GLM model, using the glm command,

just like I used the l command linear model.

So I'm going to relate the status, remember that's either a 0 or a 1, or

a case control to the snp1 variable.

So, basically relating the status to the snp on an additive scale.

And then to make it via logistic regression model,

I need to tell it to fit the family equals binomial.

So now that model has been fit, and I can tidy it up.

Now I get the estimate.

This is basically the per allele change in the log odds ratio

for each additional allele, so the change in the odds ratio for the risk.

And then what I can do is, I can actually fit different models.

So that's the additive model.

If I want to fit the dominant model, I can just change how I code that variable,

so I basically require it to have, only if there's either one or

two copies of the minor allele then you have risk.

And so that's going to be with the cases where snp1 is equal to two or three and

then I can fit the dominant model by fitting the next glm, where instead of

using the snp variables, the covariant, I fit this dominant dummy variable.

And so then I get a slightly different estimate for

that value because I'm fitting a different model and so

it's estimating something a little bit different.

The other thing I can do is, I can just directly adjust for things like

the principal components, which is what people often need to do because it's

a typical confounder in the population structure in these sorts of models.

In this case I've just added to my model, my term for the principal components.

And so then I can look at that model, and it's got estimates for all the principal

components which I'm not as interested in, and then the estimate for

the snp variable now adjusted for the principal component values.

If I want to do many of these, you can actually do that quickly in R as well.

So you can do that by using this snp.rhs.test function,

which will on the left-hand side should take the case control status or

whatever your outcome is.

On the right-hand side is adjustment variables,

this is an unadjusted model you just tell it to regress with the 1 on that side.

And then you tell it where the snpata is and it fits many glms.

So if we look at what are the names of that object?

It's an s4 class so it has a slightly different way of accessing it.

We can look and

see that one of the things that you get out of that is the chi squared statistics.

So this fits sort of a version of the generalized linear model, but

calculates the chi squared statistic.

And so we can plot those chi square statistics here versus what they

would expect.

So this is again a QQ plot, so this is the quantiles of the expected chi square

distribution versus the observed chi square distribution.

You see it's not the same.

You would expect there to be very few signals in the snpdata sets, so

you'd expect those actually to not really fall very far from the expected quantiles.

So we can fix that by looking at the adjustment.

So here's we're going to basically fit the same model,

only now we're going to adjust for the pcs.

So here we've got the status tilt of the pcs.

So this is the adjustment variables, with the same data set, so we fit that model.

And then we can look at the chi

squared values for this one.

And let's see here.

You can see that they fit much more closely along the quintiles that you would

expect once you adjust for basically the principle components or

the population structure.

So that's how you can fit generalized linear models for

logistic regression models.

And then for Poisson regression models, we can use a different data set.

We're going to here use an expression data set.

The bottomly expression data set, and then what we can do is,

we're going to filter out those cases where there aren't very many counts.

And so once we filtered out those cases, we can just directly fit a Poisson model,

say gene by gene.

And the way we would do that, is with the same glm command that used for

our logistic progression model.

But now instead of telling it family equals binomial,

we're going to say family="poisson", okay?

So now this will will fit a Poisson regression model.

So now this tells us something about on the log scale, this is going to tell us,

sort of the,

e raised to this power tells us something about the multiplicative difference.

And this is something about, basically the additive log on the log scale,

the additive difference between the two strains.

And so you can actually fit a negative binomial model, because this model

is again a little bit more constrained in the mean variance relationship.

And so you can fit a single negative binomial

model with the glm.nb function, which will then relate

The expression data for the first gene using a negative binomial model,

which is a little bit more flexible, it models the variance.

And so you see that you get relatively similar estimates for the estimates, but

the standard errors are actually a little bit bigger in the negative binomial model

because it's accounting for that additional variability.

So, you can actually do many negative binomial regressions at once,

using the DeSeq2 package.

And so one way that we do that is, we have to build a DESeqDataSet, and

so you can do that by passing this function, rather long named function,

DESeqDataSetFromMatrix function, that's a mouthful.

You can pass it the expression data and the phenotype data.

And then what the model fit you want it to be.

In this case, we just want to fit an association with strain.

And then you can fit all the negative binomial models at once by

doing DESeq to that data set.

So, it's going to estimate those.

And it's going to estimate, a smooth relationship, between between the mean and

the variance, which is going to then account for when it's doing the analysis.

And then if you want to get the results out,

you can actually just use the results function.

And then we can look at the statistics that are resulting from all of those fits.

So these are the statistics now.

This stat is for the variable that you tested,

in this case the strain was the model matrix, and so it's going to be

this is the coefficient on strain from all of those negative binomial regressions.

So that's how you can do GLM model fitting, both for Poisson regression and

logistic regression in R.