So the next thing that people do is that they fit regression model to

model the effective batch.

So this only works, again, if there's not intense correlation or

high correlation between the phenotype and the batch.

So here again, we're fitting a regression model where Y is the outcome,

the genomic measurement that we care about.

P is the phenotype that we care about, and

B is the batch variable that we measured in this data set.

So if P and B are the exact same thing, then these two variables are the same, and

this just merges into one term and you can't really adjust for it.

But if P and B are, say, uncorrelated with each other, or orthogonal to each other

because of good experimental design, this is straight ahead to estimate.

So again, we're going to fit many of these regression models,

so again we've stacked the data up along rows and then we have the samples and

columns and we want to relate that to some set of primary variables and

some set of adjustment variables, the batch variables.

We're going to fit that regression model over and over again.

So you can actually do this a little bit more cleverly using empirical

Bayes method.

So basically what this does is it shrinks down the estimates

towards their common mean.

And so this is actually a very highly cited paper with over 1000 citations for

adjusting batch effects.

And so another way that you can do this, though,

is if you basically don't know what the batch effects are.

So this is really common in genomics experiments where batch effects

could be due to a large number of things.

You end up with a model that looks like this.

You have the genes in the rows and the samples in the columns.

And here you might have some primary variables that you care about.

And then there's some random variation.

Now some of that random variation is due to sampling and usual measurement error,

but some of it might be due to batch effects or other things.

So what you might want to do is decompose this into the random

independent variation that you would expect when doing linear modeling.

And some kind of dependent variation.

Which you can further break down into an estimated batch variable.

So the idea is basically can you use the data itself

to estimate these terms over here and so

you could essentially estimate batch from the data itself and then adjust for it.

And so there's an algorithm that's been developed for

doing this called Surrogate Variable Analysis.

So imagine here you have some simulated data.

And so, in the rows are your genomic measurements, and

in the columns are your samples.

And suppose that there's a difference due to a primary variable that you care about.

And so that's the difference between the first ten samples and

the last ten samples.

But then suppose there's a batch variable that's introducing some difference

in expression due to that batch So the first thing that you could do is you could

come up with a true estimate of the batch variable.

So you're going to do that by taking an original estimate of the batch

variable like this, which could be any estimate that you come up with

that is even remotely correlated with that batch variable.

And then you want to estimate what this true batch variable is.

So it goes high for two samples, low for

two samples, high for five samples, and so forth.

So now we're going to look at this indicator.

This is the indicator that each gene is not affected by the group variable you

care about but is affected by the batch variable.

So one thing that you could do is you could use this estimate of batch in

a linear regression model to update these probability estimates.

So it's described in this paper here, this algorithm.