An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

来自 约翰霍普金斯大学 的课程

Statistics for Genomic Data Science

116 评分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

从本节课中

Module 2

This week we will cover preprocessing, linear modeling, and batch effects.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

In genomics it's often the case that you don't

want to fit just one regression model.

You want to fit many regression models, so

I'm going to show you a little bit about how to do that.

So I'm going to, again, set everything up just like I like it in terms of plotting,

and then I'm going to load the data set and the packages that we need.

So in this case we need these packages, in particular the Limma.

And edge packages are two packages that we'll be talking about a lot

during this lecture.

So, now I'm going to load the bottom lay data,

which is a data set that compares two different mouse trains.

So we're going to talk a little bit about how you fit those regression models, but

the first thing that I'm going to do is I'm going to basically remove the lowly

expressed genes, just like I have in the past.

I'm going to take the log transform and remove

those that aren't above a particular threshold, so I'm left with 1,000 genes.

So I want to fit 1,000 regression models.

So the first thing that you can do is you could fit this by

basically using the standard default lmfit function in base R.

So the thing I need to do is I need to build a model matrix.

It's going to be the same one for every case.

So this is a common example in genomics where you have

a common design matrix that's the same for every single gene,

basically because you're going to fit the exact same model for every single gene.

And so then you can fit all of those models with the lm.fit function and

you can just pass it the model that you just created, and

the expression set the expression data set that you have.

So the result is a set of coefficients and residuals and

effects from all of those regression models.

And so if I look at the coefficients from the first model fit,

you can actually see that the coefficients matrix is actually very big,

it's got many different models that fit, but

if you look at the first one, and then you compare that to

fitting the linear model to each data set individually.

Then you can see that you get about the same,

you get exactly the same coefficient estimates.

Just like you'd expect.

So this fits one model at a time.

This fits many models at a time.

So the nice thing about lmfit is it's much, much faster.

If you wanted to fit every single model with lm,

it would take a long time to do that model fitting.

And so, actually,

it turns out that you get basically the same thing from either one.

So, the lmfit is much faster when you're dealing with multiple regressions.

So then, the next thing that I'm going to do is I'm going to look at,

I can now look at the distributions.

Since I've hit many regression models I can look at the distribution

of coefficients, to see if I see anything interesting or funny about them.

So first I can look at the distribution of the coefficients for the intercept.

So you can see that they tend to be positive here, and

that's because we're modeling these sort of account data, and

then the second thing that you can do is you can model the string coefficients.

And so you can see those tend to be close to that centered around zero.

And so the further from zero they are,

the more likely there is to be an association with string.

The other thing that you can do, is you can go through using the same data set.

You can plot, from the fit, the residuals from the first model fit.

Or the residuals from the second model fit, and so forth.

So you can start looking at the, drill down into the different components of

the dataset just like you could with any individual model fit.

You can also fit adjusted models using lmfit and so

here I'm going to just create a model matrix.

That's an adjusted version of that, where I've adjusted for the lane number now.

So, I'm going to do that by adjusting for lane number.

Then, I can fit an adjusted model with Imfit,

to the adjusted model matrix and the same expression data.

Now it fits another set of regression models, and so

we can look at the coefficients again.

For the first model fit, and you can see that there's an intercept term and

the strain term, but now we've got all these adjustment variables as well.

It's a very fast way to fit multiple models in R.

If you want to do an approach that actually deals with

moderating the statistics, you can do that with Limma.

It also has a fast model fitting approach.

So here what we're going to do is we're going to use the limit package to fit

multiple regression models using the lmfit command.

So this lmfit is a little bit different.

It's not lm.fit it's lmFit and it gives you a slightly different result as well.

But you still get the coefficients and the residuals, but

here you get some other information that you're going to use,

potentially later on when doing shrunken version

of the test statistics that you would do without the model fitting like that.

And so here we can see the coefficients for the first model, now when you do lmfit

from the Limma package, the coefficients for the first model are in the first row.

Remember that the model fit coefficients for the lmfit package,

they were in the first column.

So this is the first column, versus the first row.

GSo that's a little bit different.

But otherwise it is basically the exact same thing as you would get,

except they're a little bit more shrunk towards each other.

So, the other thing that you can do is you can do this in the Edge package.

And this package can be useful if you're working with people who don't necessarily

have as good a working knowledge of model matrices and

linear algebra, because you could just pass it the data set.

And then you tell what the group variable is, what's the variable you care about.

And then you tell it what variables to adjust for,

in this case the lane variable.

So then it creates this edge study object,

which you can then fit the models to directly.

And then, if you look at that at.

It'll tell you, what models it fit, what are the coefficients and

everything like that.

You can then extract similarly the beta coefficients for

one particular set of models.

Here we're going to subtract or extract them using, it's an s4 object, which is

a little bit different, so you have to use an at sign instead of a dollar sign here.

But this will give you the coefficients for the first model fit.

And then similarly, you can get those same coefficients

from the Limma model fit, and you can

see that the one difference is that they're not necessarily in the same order.

So you can see, for example, this strain estimate here, is the last estimate here.

And then you get the adjustment covariates first, so

it tends to put the adjustment covariates first, before the other covariates.

That's because it's going to drop that group variable off

when doing model comparison later.

So that's three different ways that you can do fast regressions for

many regressions in R.