An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

来自 Johns Hopkins University 的课程

Statistics for Genomic Data Science

123 个评分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

从本节课中

Module 2

This week we will cover preprocessing, linear modeling, and batch effects.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

Linear regression and

its variants are probably the most widely used modeling tool in genomics.

So I'm going to do a quick lecture on how do we do this in R.

So again, I'm going to set up my parameters like I like them to be for

my plots when doing this analysis.

And then I'm going to load the packages for this lecture, and

in particular the broom package is the one that

we're going to be taking advantage of to make things pretty.

So here I'm going to load the body map data set.

So the body map data set can be loaded from the recount URL that I gave there and

then I'm going to define the data set

variables to be a little bit easier to see here.

I'm going to extract the expression data, the venus type data and the feature data.

So the first thing that I'm going to do is I'm going to convert the edata into

a matrix.

And so I'm going to do that so

that it makes it easier to deal with the values on a numeric scale.

And the first thing

that you can do to fit a linear model is just to use the lm command.

And I can fit that gene by gene by taking the first gene of the expression data.

So remember the genes are in the rows, and

then I can relate that using the tilde operator to any variables that I want.

In this case I'm going to say, what's the relationship between expression and age?

And so now that fits a linear model object, which you can type lm1 and

you can see that it's been fit here in that object.

If you do tidy(lm1), this gives you some information.

So what we have here is now a tidied version of that data set, and so

it gives you the estimates, the standard errors, the statistics, and the p-values.

So this estimate is at age zero, what's the average count.

So it's about 2,000 and then for

each additional year we go down a count of minus 0.23 and

then this tells us the standard area of the statistic in the p-value.

So we could actually overlay that on a plot of the data so we can plot,

age versus the first gene, so you can see that there.

And then I might want to add the linear model fit to this data set.

And so one way that you can do that is to just do abline(lm1), and

then tell it what color you want it to be and how wide you want the line to be.

So now there's the fitted values, so that the line be fit through that data set.

So then the tricky thing is that's pretty easy to do when you have

a quantitative covariate that you're modeling, but

if you look at the gender covariate that's a factor variable.

And so sometimes you have about eight males and

I'm going to make it eight females and so if I want to look at the data set

here I can do a box plot of that data versus the gender of each sample.

And so then I can overlay the points on top of it.

And so I thing you might want to do is a linear model to relate

this categorical variable to the out come of expression.

And so here you can see that there's a relationship between gender and

this genes expression.

It's not made particularly very strong.

Given these aren't that far apart and there's quite a bit of variability,

but how do we quantify that.

So one way to do that is by defining dummy variables, and so

dummy variables are basically variables that equal one if something is true.

So in this case if the gender is equal to male then you have a dummy variable

that's true when you're equal to male and false otherwise.

So if you multiply that by one and R it'll give you ones and zeros.

You can see there are some NA values here for the gender variable.

You could similarly do that.

You could just define the opposite way.

You can define a dummy variable that's only true if

the gender is equal to female.

So then R actually does this for you directly.

So you can fit a linear model relating the first genes

expression to gender, and it automatically turns that into a dummy variable.

So if we tidy that up, we can see that

we get a coefficient here that's the change in counts if you're a male.

So the average change in count is minus 105 for a male compared to a female for

that gene.

So what's going on under the hood here, R is actually if you can

look at the model matrix function, is actually what's happening under the hood.

It's taking that variable gender which was just a factor variable, and

it's turning it into ones and zeros.

One when you're male and zero when you're a female.

And then it's using that quantitative variable to fit the actual

regression model just like it was done before.

You can actually do this with even more categories.

So if you look at the tissue type variable,

it has a large number of categories here, and so

you can define multiple dummy variables, one for each tissue type.

So if you say, is the tissue type equal to adipose, then only for

the first sample is that true.

You could also say, is the tissue type equal to adrenal.

So that's only true for the second sample and so forth.

So you could define a whole series of dummy variables, and

R will actually do this for you.

So if I relate the first genes expression to this factor variable that has multiple

levels, R is going to actually fit coefficients for every one of these.

So the baseline intercept term is the one that didn't get if in the model.

So you have to be a little bit clever here, you have to figure out which of

these tissue types doesn't actually appear in the list of coefficients.

So it turns out its usually the first one, so adipose is missing, so

this is sort of the average count in adipose.

And then you can interpret this estimate as being the difference in

average count between adipose and adrenal.

So if you add the two together that's the average count for the adrenal samples.

So you can fit this more complicated example.

The other thing you can do is adjust for variables and

so we talked about adjusting for variables.

And the way that you can do that is basically by expanding the model formula.

So here going to add,

we're going to add a model that has both an age term and a gender term.

And so then, if I tidy up that data set,

I can see that there's estimates here for age and for gender.

And so for example, this estimate for age is the difference in counts for

one extra year of age when you hold gender fixed.

So either within the males or within the females.

So then you can actually fit interaction models if you want to be able to

get a little bit more complicated, because this fits the same relationship for

the males and the females.

And so to do that in R, you basically use the multiplying command in the formula.

So you go, if you want to fit interaction term,

pdata age times pdata gender, and

if you tidy that model up, you can see now there's a term for age and a term for

gender and a term for age interacting with gender.

And so what does this mean it means it fits one line if you're a male then you

have the age, the age effect is basically minus 13 minus 18.

But if you're a female this term goes to zero because the gender male term is zero

So this is the line, the slope of the line,

minus 13 with respect to age for females.

So you get two different lines with two different slopes by fitting

an interaction term.

These get a little trick so

you've got to be a little careful when interpreting them.

The next thing that you can do is you can basically overlay these points again,

on top of the graph.

So here I'm going to fit a linear model relating the sixth genes data to the age,

and I can plot that gene and I can see that there's an outlier there.

So if I add the line, you can see that it's

flat, but you can actually do this when you

actually subtract out values that have outliers in different places.

So in this case the outlier doesn't necessarily affect the line fit,

it doesn't look like it's pulling it one direction or another.

But we can check out another case where it might.

And so to do that I'm going to instead of plotting that variable versus age I'm just

going to change it so that I plot it versus the index of where it appears.

And so when I do that if I plot the data on that scale so if I plot the index

versus this value, I can see that the outliers right over here at the end.

So now if I fit a relationship between index and the six genes expression, I

can see that, so I've got this plot with the outlier at the end.

I can see that the, in the overlay the model fit,

I can see that it's being pulled up a little bit by this outlier.

So you can see that, instead of having a flat line,

there's a line that's sort of being pulled up by the outlier over there.

So, if I remove that outlier, suppose I just take that one outlier out and

I refit the model, so now here I'm doing the same thing,

I'm fitting the same model, but see I'm subtracting the 19th data point.

From both the predictor and from the outcome, and so

then if I make that same line,

but now for the model field without the outlier, you can see that it's flat.

So you can see that basically that outlier is what's driving this whole curve.

So you gotta be very careful when you see outliers on these data sets.

I'm adding a legend here to show which one is the curve with the outlier and

which one's without the outlier.

But in general you can see that adding outliers,

especially near the ends of curves can really change the values that you get.

If I look at the residuals from these two, so if I make a plot side by side like in

part(mfrowc(1,2)), I can make a histograms of

the residuals, for each of the two models.

The one without the outlier and one with the outlier.

You can see that one with the outlier has a huge outline residuals.

Well, that's not the way you can diagnose if you have these residuals.

So the other thing you can do is, you can, if you want to see these again,

these distributions, because their counts aren't necessarily centered at zero,

they don't necessarily have the distribution that we'd like.

And so one thing that we can do is you can actually do the transformation

that we've been doing.

We can do log two transform data, and

you can fit a linear aggression model to the transform data.

And then if you look at the residuals, compared to the residuals

from one of these other examples, they actually look a little bit better.

Because we've kind of made this scale more appropriate by doing this transformation.

And so the other thing that you can do is you can basically fit these models

with as many coefficients as you want and R will fail gracefully but

sometimes that's not necessarily a good thing.

So here's an example where I can take that same

transform data and I fit a model with tissue type, and a model with age.

And then if I tidy up that model, you can see many of the coefficients

are not defined and that's the standard errors and statistics are not defined and

that's because you have too many covariance here.

So the dimension of this model that we're fitting.

So if you look at the dimension of the model matrix that you're fitting.

It's 16 by 18.

So you're basically fitting too many data points.

There are only 16 data points,

but you have 18 parameters that you're trying to fit.

And so R will fill gracefully, it will actually come up with estimates,

though they're not very reliable given that we basically soaked up all

the data just to estimate this linear regression model.

So you have to be very careful when you're including covariants to make

sure you haven't included covariants with too many different levels.

So then the last thing that you can do here is if you fit a linear

model relating a variable to age, you can then plot the residuals.

And you can see those residuals colored by different variables.

So, here, I'm going to do this.

Plot on the scale where you can see it.

One thing that a diagnostic that people often use when doing these things is

a fit, a linear regression model, then they plot the residuals and

here they're going to color the residuals by the tissue type.

And so you can see for example this tissue type has a residual that is very high.

So this can be often a very useful diagnostic is to plot the residuals

colored by different variables.

So you can identify variables that seem to be associated with the residuals, and

they might be a variable that you want to actually include in your model after all.