An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

来自 约翰霍普金斯大学 的课程

Statistics for Genomic Data Science

116 评分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

从本节课中

Module 2

This week we will cover preprocessing, linear modeling, and batch effects.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

By it's nature genomics data is usually very high dimensional, and so

you want to reduce that dimension when visualizing or modeling the data.

So here I'm going to do the typical set up steps to get my plotting

parameters like I like them and to load the libraries that we'll need.

In this case, it's mostly the base packages that we're going to be using, and

then I'm going to load in a data set here from this URL.

There's actually a combination of two data sets from Montgomery and

a pick roll paper.

And so they are actually two different populations measured in two different

labs and that'll be useful for this lecture.

And so I'm going to load the data in and I'm going to, again, extract out

the phenotype data, the expression data, and the feature data for this data set, so

that we can use that data to make some plots and to do some dimension reduction.

So again, just to make it a little bit easier visualize,

I'm going to sort of subtract out all the rows where

the row mean is less than 100, so I reduce the size of the data set.

And then I'm going to apply the log transform so

that it will be on a scale that's a little easier to work with.

So the next thing that I'm going to do is I'm going to actually center the data,

because when we're doing the singular value decomposition,

if you don't center the data, if you don't remove the row means of the data center,

the column means of the data set,

then the first singular value of your vector will always be the mean level,

since that will always explain the most variation in genomic experiment.

And we actually want to see variation between samples or between genes,

so we're going to remove that means versus variation and

look at the ones that are different between genes.

And so once I got that center data set,

I can apply the svd function to calculate the singular value decomposition.

So this singular value decomposition has three parts to it.

These three matrices d, u, and v.

So d is the diagonal matrix, and so

it just returns the diagonal elements of that matrix for you.

So there's, in this case, the data set that we're dealing with

has 129 columns, so there's 129 singular values.

And then the other components, the p and u components,

have 129 values for the v component.

So that's basically telling me something about the variation across genes, and

then the variation across samples is something about you.

And so the first thing that we might want to do is plot the singular values.

And so we're going to plot the d values and this would be the singular values.

And I'm going to make those in blue.

So here I can see those singular values plotted versus their index, so

they're ordered from the biggest to the smallest.

And so then the next thing I want to do is plot the variance explained, and so

to do that remember that I have to calculate each singular

value squared divided by the sum of the singular values squared.

And so once I've done that, I've calculated the variance explained.

And so, I can plot that, again in blue, on the same kind of plot, and

I can see that the first singular value explains more than 50% of the variant.

So it's a highly explanatory variable.

So then I can make a plot of that and see what could that variable be?

Again, I'm going to make a plot that's two panels, so

I use the par m f equals one two.

And then I'm going to plot the first two igungenes or

right singular vectors, or principal components.

You'll see in a minute that they're not exactly the principal components but

people use them sort of interchangeably.

So I plotted that first principal component and then I plot the second one.

And so the first thing that people often do is they might want

to color these by different variables to see if there's something going on.

To do that they usually plot.

It's very common to plot the first singular vector versus the second

singular vector, right singular vector.

So here I'm going to set it up so that there's a one-by-one plot again.

And so if I make that plot, I can see there's this pattern here, and

the thing that people often do is they make this plot,

only they color it by a particular variable.

So in this case, I'm going to color the PCs by what study they come from,

so here I'm setting the color to be the numeric version of the study variable.

And so I remove the color from the previous plot, and so you can see here, if

you look in the PC1 axis, the two studies have very different values of the PC.

So it seems that one of the big sources of signals in the data set is which study

the two data sets come from.

You can see this also, a way that people often do this is to make a box plot of

that first principle component, because you can see that's the one that separates

the two studies versus the study variable.

And then it's always a good idea to show as many of the data points as possible so

you can overlay the data points on top of the box plot by

plotting the same singular vector versus a jittered version of the study variable.

In coloring it by the study variable, you can see that there's a big difference in

that first principal component between the Montgomery and the Pickrell studies.

So, that's how you do the singular value decomposition.

So, to do the principal components you can use the PR comp function and

apply it to the same data set.

And so, even though I've been sort of using the two terms interchangeably they

are not quite the same thing.

So if I plot the first principal component verses the first

singular vector they're not the same thing, and

that's because I haven't actually scaled them in the same way.

So it turns out if you actually scale the data by removing, so now I'm

doing a second set of centering, but here what I'm doing is I'm actually subtracting

the column means rather than subtracting the row means.

Then I have a data set that's centered by column instead of centered by row,

and so then I can calculate the singular value decomposition on that.

[NOISE] And when I do that and then I plot the first principal

component versus the first singular vector from the column center data,

I actually get that they're identical to each other.

And so basically what's going on is that if you column center the data then do SVD,

you get exactly the principal components because the principal components

are calculating something about the variability between the columns when

they're doing that.

And so you can get PCs and

SVDs that actually compute the exact same thing if you do the centering right.

One thing to keep in mind is that outliers can really drive these decompositions,

so to illustrate that I'm going to just take our edata centered,

I'm going to assign it to the new variable edata outlier.

And then I'm going to make one of those values really outlined, so

I'm going to take this sixth gene and I'm going to multiply it by 10,000.

So this is now a very outlined gene of very high values.

So now I'm going to apply the SVD to the outlying dataset and

if I plot the original version of this decomposition where I did this

SVD on the dataset without this outlier versus the dataset with the outlier,

so then I can see that [NOISE] they don't match each other anymore.

You sort of don't see that the two data sets don't necessarily match

in terms of their singular value decomposition, but

you can definitely see that the singular value, or singular vector for

the composition with the outlier reflects that outlier quite accurately.

And so if you plot the first singular vector from this

new decomposition with the outlier verses the outlying value itself you can see that

they're very highly correlated with each other.

So what's happening is the decomposition is looking for patterns of variation well

if one gene is way higher expressed and on the other ones, then it's going to drive

most of the variation in the data set and so it'll be very correlated with it.

So you have to be careful when using these decompositions to make sure that you

pick the centering and scaling so that all of the different measurements for

all of the different features are on a common scale.