0:00
A unique thing about regression modeling in genomics is that you
often fit many regression models simultaneously.
The reason why is you usually have many measurements, and so
each of those measurements is designed to be able to be correlated with
an outcome that you care about.
So here, for example, is the typical genomics data set.
You have a large number of features in the rows.
So that could be tens of thousands or millions of features,
whether they're SNPs, measurement of methylation, MCPGs,
gene expression levels, or transcript expression levels.
And then you have some varying conditions.
And usually you have some kind of phenotype like case control status.
You would like to associate each feature with case control status and
you would like to discover those features that are differentially expressed or
differentially associated with those different conditions.
So to do this,
you usually end up with a matrix formulation of this same regression model.
So you can imagine that, for every single row of this matrix,
you'll fit a regression model that has some B coefficients multiplied by some
design matrix, multiplied by some variables that you care about,
plus some corresponding error term for just that gene.
And then you would stack a bunch of these up.
So this is a bunch of stacked regressions.
I'm showing it here in mathematical notation on the bottom.
You write matrix multiplication, to write down these many multiple regressions.
And then I'm showing it in block format up here.
So you model the data for this gene, Based on these coefficients multiplied by
these variables multiplied by, they're adding up this error term right here.
And you do this for every single feature that you're modeling.
So here's this example where you are looking at gene expression signatures
associated with geography in a particular population in Morocco.
And so, there's a primary biological variable that you might care about or
variable that you care about in this case might be where they actually come from.
What's the geography that they come from?
And then you might have a bunch of adjustment variables,
like are they males or females?
What batch they come from?
All sorts of other variables that you might have.
And so the model actually becomes a little bit more difficult when you're dealing
with such a case because there's all sorts of variables that you obviously need to
model like the location that the people come from, their sex, the batch.
There's also a much more subtle effect.
Say, intensity dependent effects in the measurements from the genomic data or
dye effects or probe composition effects since this is a microarray.
And then many other unknown variables that you might want to model.
So when you do this, you actually end up with a slightly more complicated model.
Again, this is in colored blocks, the observation version of this.
And so again, you might model the measurements for one gene that are in one
row as a function of the coefficients in one row times the set of variables that
you actually care about, in this case it might be geography.
Plus the coefficients in one row for a set of adjustment variables that you might
care about plus the random variation for that one row.
So now you've got a model that you're fitting many, many regression models.
You fit them all the exact same way as you fit a single regression model but
now you have to interpret them jointly.
And so there's a couple of different things that are difficult.
One is that you have hundreds, thousands or millions of model fits
at the same time and for each one we have estimates of the variables, the residuals,
we have the fitted values and there can be structure in any of those things.
There can be structure in the estimates, there can be structure in the noise and
there's all sorts of issues that may be due to different values of the covariance
and different unmeasured confounders.