An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

来自 Johns Hopkins University 的课程

Statistics for Genomic Data Science

123 个评分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

从本节课中

Module 2

This week we will cover preprocessing, linear modeling, and batch effects.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

Maybe the biggest confounder in most studies is what's called batch effects.

So I'm going to talk a little bit about batch effects and confounders, and

how do you adjust for them, and how do you deal with them.

So what are the sources of batch effects?

So batch effects is actually quite a broad term when statisticians talk about it.

When it's used by biologists and genomic scientists, they often talk about

the batch that the samples were processed in, so the time in which a group of

samples that went through together, or the slide that they went through together on.

But it turns out that batch effects is a surrogate for

lots of different confounders.

So there could be external factors like the environment that might affect

the level of genomic measurements, there could be genetic or

epigenetic factors that contribute to the overall expression of a gene,

or the overall epigenetic profile.

And then there could be technological factors.

So for example, if you have a diligent scientist versus a crazy scientist working

away and doing an experiment, you might get different results.

So all of these things are confounders that you might need to adjust for

when doing your experiment.

So here's a quick example of that.

So I've taken three studies, you can find the results of these studies in these

different papers that I've linked to here.

And so I colored by different things.

So, in this case, I've done a clustering, and I colored by the environment.

And so you can see that the orange environment

sort of clusters together into these two clusters.

Here, I colored by processing year and

here you can see that the purple processing year clusters together.

And then here, in this study,

I've actually clustered the data just by a particular allele,

and so here, you can see that this allele, the orange case, they cluster together.

So these are expression measurements that cluster together by different variables.

Any one of these could be a confounder or a batch effect that you have to adjust for

If it's not the variable of interest in your study.

So this doesn't just affect the continuous measurements like gene expression

measurements.

This is actually data from the 1000 Genomes Project,

so this is a particular genomic location between this base pair and this base pair.

And so you can see the samples are ordered by date here.

And this is after normalizing the samples to have the same read coverage for

all the different samples on average, so the global distributions are the same.

You can see that there's still a set of samples here that seem to have

a much higher level of coverage than the set of samples

that were processed on a different date here for this region.

And so this sort of batch effect appears in lots of different

types of genomic data.

So when can you deal with batch effects and

when can you not deal with batch effects?

Well if the same biological group is run on the same batch, in other words,

if each biological group is run on its own batch then it's impossible to tell

the differences between group and biology, sorry the group biology and

the batch variable when you're doing your statistical analysis.

Now if on the other hand you run replicates of the different groups on

the different batches, so

you get a sum from each different group on each different batch, then it's possible

to distinguish the difference between the batch effects and the group effects.

So the first thing to dealing with these batch effects is good study design and

you get blocking or randomization of samples.

So the next thing that people do is that they fit regression model to

model the effective batch.

So this only works, again, if there's not intense correlation or

high correlation between the phenotype and the batch.

So here again, we're fitting a regression model where Y is the outcome,

the genomic measurement that we care about.

P is the phenotype that we care about, and

B is the batch variable that we measured in this data set.

So if P and B are the exact same thing, then these two variables are the same, and

this just merges into one term and you can't really adjust for it.

But if P and B are, say, uncorrelated with each other, or orthogonal to each other

because of good experimental design, this is straight ahead to estimate.

So again, we're going to fit many of these regression models,

so again we've stacked the data up along rows and then we have the samples and

columns and we want to relate that to some set of primary variables and

some set of adjustment variables, the batch variables.

We're going to fit that regression model over and over again.

So you can actually do this a little bit more cleverly using empirical

Bayes method.

So basically what this does is it shrinks down the estimates

towards their common mean.

And so this is actually a very highly cited paper with over 1000 citations for

adjusting batch effects.

And so another way that you can do this, though,

is if you basically don't know what the batch effects are.

So this is really common in genomics experiments where batch effects

could be due to a large number of things.

You end up with a model that looks like this.

You have the genes in the rows and the samples in the columns.

And here you might have some primary variables that you care about.

And then there's some random variation.

Now some of that random variation is due to sampling and usual measurement error,

but some of it might be due to batch effects or other things.

So what you might want to do is decompose this into the random

independent variation that you would expect when doing linear modeling.

And some kind of dependent variation.

Which you can further break down into an estimated batch variable.

So the idea is basically can you use the data itself

to estimate these terms over here and so

you could essentially estimate batch from the data itself and then adjust for it.

And so there's an algorithm that's been developed for

doing this called Surrogate Variable Analysis.

So imagine here you have some simulated data.

And so, in the rows are your genomic measurements, and

in the columns are your samples.

And suppose that there's a difference due to a primary variable that you care about.

And so that's the difference between the first ten samples and

the last ten samples.

But then suppose there's a batch variable that's introducing some difference

in expression due to that batch So the first thing that you could do is you could

come up with a true estimate of the batch variable.

So you're going to do that by taking an original estimate of the batch

variable like this, which could be any estimate that you come up with

that is even remotely correlated with that batch variable.

And then you want to estimate what this true batch variable is.

So it goes high for two samples, low for

two samples, high for five samples, and so forth.

So now we're going to look at this indicator.

This is the indicator that each gene is not affected by the group variable you

care about but is affected by the batch variable.

So one thing that you could do is you could use this estimate of batch in

a linear regression model to update these probability estimates.

So it's described in this paper here, this algorithm.

So then you could weight the matrix and recalculate, say some decomposition,

say the principal components or singular value decomposition like we have,

and then we update the estimate of the batch variable.

We can then use the estimate of the batch variable to update the probability

weights, and update the data by weighting by those weights again, and

then re-update the new batch variable.

So now, once you've done this iterative algorithm,

you sort of removed the genes that are mostly driven by the group variable,

you're focusing on the genes driven by batch, the decomposition gives

an estimate for batch, which is pretty close to the real batch variable.

You can then include this as if it was a measured batch variable in

your statistical analysis, to adjust for that variable and remove batch.

So, this is what you would do if you don't have the batch variable measure.

If you want an introduction to batch effects, this review paper is very good.

This paper that I've mentioned about adjusting for

batch effects with empirical Bayes is also a very good introduction to adjusting for

batch effects for batch effects when they're known.

And then if you want to learn a lot more about surrogate variable analysis,

that last technique for estimating batch effects from the data itself,

you can check out this paper below.