An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

来自 Johns Hopkins University 的课程

Statistics for Genomic Data Science

123 个评分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

从本节课中

Module 2

This week we will cover preprocessing, linear modeling, and batch effects.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

One of the most common causes of false positives in genomics are batch effects or

other confounders.

So I'm going to show you a little example of how you remove those in r.

So I'm going to set up my plotting parameters like I'd like and

then I'm going to load the libraries that we need.

In this case, it's the SVA package and the snip stats package in the main

drivers of most of the analysis that we're going to be doing here.

So I load those packages, and then I'm going to load this bladder data set.

In this case, I have a expression set, a bladder expression set.

I'm going to extract the phenotype information and

the expression information from it.

And so now I have an expression object that has 57 samples in it,

and I have some information on the phenotypes for that.

And so in this case I have information on the outcome and I also have information

on a batch variable and then whether it's a cancer sample or not.

And so the first thing that you might want to do if you want to adjust for

you want to deal with the batch effect is if it's labeled like this, if you actually

have the batch variable you might be able to directly adjust for it.

So the way that you could do that is you could build a model matrix for

the cancer variable.

And then you could actually just add in another one, the batch variable,

say where you want the data to come from, that's from this phenotype data.

So now I've got this model matrix, which I can then fit with lm.fit.

We learned in this last lecture and so now

I have these coefficients that have been adjusted for batch effects, basically.

I've adjusted for them directly with the lm.fit function.

So I can now look at the coefficients for

the variable that I care about across all of the different samples.

And so that works out pretty well the one thing that you have to be a little bit

careful of is if there's a tight correlation between the batch variable and

the outcome variable that you care about.

So ideally you'll have balanced within each batch you'll have

observations from multiple different groups.

In this case it's a little bit tricky, because for

example say the fourth batch only has biopsy samples.

So fortunately they're not perfectly confounded in the sense that

in the fifth batch you actually had some biopsy and some cancer samples.

And similarly you have some comparison between cancer and

normal within the second batch so you have some information to gain there.

But if these are perfectly correlated, in other words if all the cancer samples

are run in batch 1 and all the normal samples are run in batch 2.

You can't use this,

any of the adjustment methods I'm going to be talking about here.

So you have to be very careful about that at the experimental design stage.

Another way that you can adjust for batch effects if you want to,

if you actually know them if you have them in advance is to use Combat.

Which is basically is the similar approach to direct linear adjustment but

it basically does some of the shrinkage we talked about.

So Ttat can have some advantages when you have especially small sample sizes.

And so, here I have to build a model matrix for the combat function.

And so, the model matrix in this case,

I'm going to tell it has nothing but an intercept term.

And then I have the model that I'm actually going to test.

Which is the model that has the cancer term in it.

So then what I can do is when I'm running combat,

I can basically run combat to get a cleaned expression data set.

So I do combat and then I tell it what data to look for

in this case it's the expression data and I tell it what the batch variable is and

then I tell it is there anything else that it should be adjusting for.

And then, it basically fits

a set of models and you can have it either plots or output, or not.

And so, what will end up happening, though, is that it's going to fit

this model, and it's going basically regress out the badge effect.

And so, then you can go through, and you can do lm.fit on the cleaned data.

So you basically get the combat fit on this data and

then you can fit the lm.fit model with our cancer model

matrix now with the data set that was cleaned with combat.

So now I have a combat fit coefficients which I can look at as well.

So I can look at the coefficients from this.

So I can see here that they seem to be a little bit smaller than the coefficients

that I got when I fit the direct linear adjustment, and

in fact, that tends to be the case.

So here I'm going to make a plot It has a few parameters so I'm not going to

type the whole thing here but I'm going to plop the coefficients from the original

linear model fit versus the combat model fit.

So if I do that I can actually add the 45 degree line.

Actually, here's what I'm going to do I'm going to make it a one by one plot and

then I'm going to make that plot and add the 45 degree line.

And so you can see for

example that the combat values tend to be a little bit smaller, so the y axis is

a little bit smaller than the x axis here because we've done that shrinkage.

So the other thing that can happen is imagine you either don't know that batch

variable, or imagine that batch variable doesn't encompass all the different

potential technical artifacts, then you might want to infer the batch variables,

and you can do that with the SVA package SVA function.

So here I'm going to build that model matrix with the cancered outcome that I

care about.

And then I'm going to build the model that I'm going to compare that to.

So, in this case, I'm going to compare it to a model with no covariants in it so

it's the null model.

And then what I can do is actually create some covariants by

running the SVA function on this data set.

So I pasted the expression data set, the model for including the variable I care

about, the model not including the variable I care about.

And then I tell it how many surrogate variables, or

how many surrogate batch effects it should estimate.

So then it's going to go through and

estimate those batch effects through an iterative procedure.

So now what I can do is I can actually see that the SVA object that's returned has

this SV component, which is basically new co-variants that have been created.

That are the potential batch effects, and so for example, I can plot,

or I can correlate those batch effects with the actual observed

batch variable, and I can see that our estimated surrogate variable is,

this second one is super highly correlated with the batch file.

The first one isn't necessarily as correlated with the batch variable

as the first one.

So then I can make a plot of that second surrogate variable

versus the batch variable.

And I can see that there's like a relationship between the inferred

surrogate variable, the inferred batch effect, and the observed batch effect.

And so then I can actually plot these

data points on top of that to see if that's true.

So what we've done here with the SVA is not necessarily actually cleaning

the data set.

We've just identified new covariates that we now need to include in our model fit.

So we can combine them with our original model that had the cancer term in it.

And so we end up with a model fit that includes both

our estimated surrogate variables as well as our cancer status variables.

And so then I can use lm.fit or any of the other methods we talked about in

the previous lecture to get the model fit after you adjust for those variables.

So now what I can do is I can see what the model fits look like

comparing SVA to combat for example.

Here, I'm plotting SVA versus combat.

So on the x axis is going to be the SVA adjusted values,

on the y axis is the combat adjusted values.

So you can see that they're actually pretty highly correlated with each other.

And then you can do the same thing, for

the SVA versus the just the linear model adjusted without the combat adjustment.

And you can see here, that these aren't necessarily shrunken and so

they are maybe a little bit more correlated without the shrinking.

So that's basically how you can run SVA on

the data set to infer the batch effects if you don't know what they are.