An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

110 ratings

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 2

This week we will cover preprocessing, linear modeling, and batch effects.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

Maybe the biggest confounder in most studies is what's called batch effects.

Â So I'm going to talk a little bit about batch effects and confounders, and

Â how do you adjust for them, and how do you deal with them.

Â So what are the sources of batch effects?

Â So batch effects is actually quite a broad term when statisticians talk about it.

Â When it's used by biologists and genomic scientists, they often talk about

Â the batch that the samples were processed in, so the time in which a group of

Â samples that went through together, or the slide that they went through together on.

Â But it turns out that batch effects is a surrogate for

Â lots of different confounders.

Â So there could be external factors like the environment that might affect

Â the level of genomic measurements, there could be genetic or

Â epigenetic factors that contribute to the overall expression of a gene,

Â or the overall epigenetic profile.

Â And then there could be technological factors.

Â So for example, if you have a diligent scientist versus a crazy scientist working

Â away and doing an experiment, you might get different results.

Â So all of these things are confounders that you might need to adjust for

Â when doing your experiment.

Â So here's a quick example of that.

Â So I've taken three studies, you can find the results of these studies in these

Â different papers that I've linked to here.

Â And so I colored by different things.

Â So, in this case, I've done a clustering, and I colored by the environment.

Â And so you can see that the orange environment

Â sort of clusters together into these two clusters.

Â Here, I colored by processing year and

Â here you can see that the purple processing year clusters together.

Â And then here, in this study,

Â I've actually clustered the data just by a particular allele,

Â and so here, you can see that this allele, the orange case, they cluster together.

Â So these are expression measurements that cluster together by different variables.

Â Any one of these could be a confounder or a batch effect that you have to adjust for

Â If it's not the variable of interest in your study.

Â So this doesn't just affect the continuous measurements like gene expression

Â measurements.

Â This is actually data from the 1000 Genomes Project,

Â so this is a particular genomic location between this base pair and this base pair.

Â And so you can see the samples are ordered by date here.

Â And this is after normalizing the samples to have the same read coverage for

Â all the different samples on average, so the global distributions are the same.

Â You can see that there's still a set of samples here that seem to have

Â a much higher level of coverage than the set of samples

Â that were processed on a different date here for this region.

Â And so this sort of batch effect appears in lots of different

Â types of genomic data.

Â So when can you deal with batch effects and

Â when can you not deal with batch effects?

Â Well if the same biological group is run on the same batch, in other words,

Â if each biological group is run on its own batch then it's impossible to tell

Â the differences between group and biology, sorry the group biology and

Â the batch variable when you're doing your statistical analysis.

Â Now if on the other hand you run replicates of the different groups on

Â the different batches, so

Â you get a sum from each different group on each different batch, then it's possible

Â to distinguish the difference between the batch effects and the group effects.

Â So the first thing to dealing with these batch effects is good study design and

Â you get blocking or randomization of samples.

Â So the next thing that people do is that they fit regression model to

Â model the effective batch.

Â So this only works, again, if there's not intense correlation or

Â high correlation between the phenotype and the batch.

Â So here again, we're fitting a regression model where Y is the outcome,

Â the genomic measurement that we care about.

Â P is the phenotype that we care about, and

Â B is the batch variable that we measured in this data set.

Â So if P and B are the exact same thing, then these two variables are the same, and

Â this just merges into one term and you can't really adjust for it.

Â But if P and B are, say, uncorrelated with each other, or orthogonal to each other

Â because of good experimental design, this is straight ahead to estimate.

Â So again, we're going to fit many of these regression models,

Â so again we've stacked the data up along rows and then we have the samples and

Â columns and we want to relate that to some set of primary variables and

Â some set of adjustment variables, the batch variables.

Â We're going to fit that regression model over and over again.

Â So you can actually do this a little bit more cleverly using empirical

Â Bayes method.

Â So basically what this does is it shrinks down the estimates

Â towards their common mean.

Â And so this is actually a very highly cited paper with over 1000 citations for

Â adjusting batch effects.

Â And so another way that you can do this, though,

Â is if you basically don't know what the batch effects are.

Â So this is really common in genomics experiments where batch effects

Â could be due to a large number of things.

Â You end up with a model that looks like this.

Â You have the genes in the rows and the samples in the columns.

Â And here you might have some primary variables that you care about.

Â And then there's some random variation.

Â Now some of that random variation is due to sampling and usual measurement error,

Â but some of it might be due to batch effects or other things.

Â So what you might want to do is decompose this into the random

Â independent variation that you would expect when doing linear modeling.

Â And some kind of dependent variation.

Â Which you can further break down into an estimated batch variable.

Â So the idea is basically can you use the data itself

Â to estimate these terms over here and so

Â you could essentially estimate batch from the data itself and then adjust for it.

Â And so there's an algorithm that's been developed for

Â doing this called Surrogate Variable Analysis.

Â So imagine here you have some simulated data.

Â And so, in the rows are your genomic measurements, and

Â in the columns are your samples.

Â And suppose that there's a difference due to a primary variable that you care about.

Â And so that's the difference between the first ten samples and

Â the last ten samples.

Â But then suppose there's a batch variable that's introducing some difference

Â in expression due to that batch So the first thing that you could do is you could

Â come up with a true estimate of the batch variable.

Â So you're going to do that by taking an original estimate of the batch

Â variable like this, which could be any estimate that you come up with

Â that is even remotely correlated with that batch variable.

Â And then you want to estimate what this true batch variable is.

Â So it goes high for two samples, low for

Â two samples, high for five samples, and so forth.

Â So now we're going to look at this indicator.

Â This is the indicator that each gene is not affected by the group variable you

Â care about but is affected by the batch variable.

Â So one thing that you could do is you could use this estimate of batch in

Â a linear regression model to update these probability estimates.

Â So it's described in this paper here, this algorithm.

Â So then you could weight the matrix and recalculate, say some decomposition,

Â say the principal components or singular value decomposition like we have,

Â and then we update the estimate of the batch variable.

Â We can then use the estimate of the batch variable to update the probability

Â weights, and update the data by weighting by those weights again, and

Â then re-update the new batch variable.

Â So now, once you've done this iterative algorithm,

Â you sort of removed the genes that are mostly driven by the group variable,

Â you're focusing on the genes driven by batch, the decomposition gives

Â an estimate for batch, which is pretty close to the real batch variable.

Â You can then include this as if it was a measured batch variable in

Â your statistical analysis, to adjust for that variable and remove batch.

Â So, this is what you would do if you don't have the batch variable measure.

Â If you want an introduction to batch effects, this review paper is very good.

Â This paper that I've mentioned about adjusting for

Â batch effects with empirical Bayes is also a very good introduction to adjusting for

Â batch effects for batch effects when they're known.

Â And then if you want to learn a lot more about surrogate variable analysis,

Â that last technique for estimating batch effects from the data itself,

Â you can check out this paper below.

Â Coursera provides universal access to the worldâ€™s best education,
partnering with top universities and organizations to offer courses online.