An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

来自 约翰霍普金斯大学 的课程

Statistics for Genomic Data Science

116 评分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

从本节课中

Module 2

This week we will cover preprocessing, linear modeling, and batch effects.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

So batch effects can either be a technological artifact like

we saw in the previous lecture, or they could even be biological confounders.

So one place that that comes up often is in genetics.

So I'm going to, again,

set up my graphical parameters like I usually do in all these videos.

And then I'm going to load the libraries that we need in particular, the SNP stats

package is the one that we're going to be talking about a lot here.

And so, now I'm going to look at some genetic data.

And so, I'm going to load that data from the SNP stats package.

I'm going to look at just the data from a set of controls here.

So I'm going to just subset down to the set of controls dataset.

And so this is basically a set of data that

consists of controls, from different populations here.

So we want to see,

if there's a relationship between population in this genetic data.

And so just to make the computations go faster,

I'm going to actually just take a subset of these data, so

I'm going to do every tenth value, I'm going to take.

And so now I'm just going to take a subset that are both controls, so

they don't have any disease association,

as well as they're the ones that are the every tenth one, so that it's not too big.

So the dataset is a manageable size.

So the first thing that I can do is I'm basically going to calculate

the principal components, and you can do that for genetic data,

using some commands from the SNP stats package.

So the first thing that you need to do is calculate

a particular linear algebra multiplication.

And so you can do that, with the command XXT.

So this is an intermediate calculation to the PCA that we're not going into the math

details to worry about too much.

But then the next step is to do the eigen d composition of that calculation.

And then, the PC's are going to be the eigenvectors that come out of that.

So I've got the eigenvectors here, I'm going to take the first five.

Those are the ones we're going to look at.

So now I've have this PC's matrix.

And so each column of this is one of the PC's that we might want to look at.

So the next thing that I'm going to do is look at what population they come from.

So I'm going to take the data from the subjects and I'm going to subset to just

the controls, and I'm going to take out the stratum variable, and that's going to

be the population stratification variable that we're going to look at.

So I'm going to take that dataset, I'm going to take only the control samples and

I'm going to take out the stratum variable.

And so, I have this pop variable, which tells me which population they come from.

And so then, what I can do is I can plot the first principle component

versus the second principle component.

And I can color it by population, and then, we can see

if there's any relationship between population and these principal components.

And so in this case, there is a strong relationship between PC1 and

the populations.

Remember, the population is the color here.

And so, I can add a legend to the plot to make it a little easier to see.

So you can see that the European samples over here are very different from

the Asian samples on the first principal component.

So this is often the case in genetic data,

that we have a major effect on the genetic state are due to the population.

Now, if the disease that you're interested in is also associated with population,

it could be a confounder.

So the typical way that people deal with this is by using just direct

principle component analysis, and PC's for the adjustment.

The reason being that the signal is often weak enough in genetic association

studies that it's not likely to be captured by the PC's.

And so you can just remove the PC's without being too worried about

removing the