An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

来自 约翰霍普金斯大学 的课程

Statistics for Genomic Data Science

116 评分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

从本节课中

Module 2

This week we will cover preprocessing, linear modeling, and batch effects.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

So one of the things that's very common across all types of genomics and

genetic measurements is that you often make a lot of measurements per sample.

And so when you have a huge number of measurements, say 20,000 or a 100,000 or

a million measurements per sample, you want to be able to visualize and

communicate patterns and identify relationships.

And the best way to do that is to reduce the dimension, or

reduce the number of measurements that you're looking at at any given time.

So here I'm going to illustrate this with a really simple example.

I've generated a data set here, this is a simulated data set with 40 different

measurements for ten different samples.

And you can see that their appear to be a two clusters that are driven by

measurements in these rows that are very different from each other.

So one way that you could reduce the dimension of this is you could just take

the average and in the rows and the columns.

So here I've taken the row average, so I take every single row and

I calculate it's average and plot it in this graph.

So there's the average for row one and the average for row two and so forth.

So you can see because there's this block pattern here, there's a difference in

the overall, average for that row compared to rows that don't have that pattern.

Similarly, if I take the average for this column and plot it here,

and the average for this column and I plot it here, and so forth.

I can see, again, the difference in the patterns between these and these by

looking at the difference in the pattern between these two groups of points here.

Now this works really well when there's only one pattern,

and all the effects go in the same direction.

But in general it's not that easy.

So you might want to think about different ways of doing this.

And so there's two related problems here.

Imagine you have this data matrix X like the one we saw in the previous example.

And you want to find a new set of multivariate variables that

are uncorrelated with each other and that they explain as much of the variability

across those rows or across the columns as possible.

A related mathematical idea is to find the best matrix

that's an approximation to the original matrix that's lower rank or

has fewer variables and that explains the original data.

These two goals are a little bit different.

One is statistical, one is data compression or mathematical, but

it turns out that they can have very similar solutions.

So here is the solution, it's a singular value decomposition.

Imagine that you have a data matrix here, so here imagine you have the genes or

features, snips in the rows, and

then you have the samples over here in the columns of the matrix.

Then you can decompose it into three matrices, U, D, and V transpose.

So those three matrices comprise the left singular vectors,

the singular values and the right singular vectors in the matrix.

So in the left singular vectors we see patterns that exist across the different

rows of the data sets.

So this is equivalent to like taking the row means that we

talked about in the previous slide in the sense that

it's trying to identify patterns across the rows.

The D matrix tells you how much of each of the patterns that you have in the U matrix

explain.

So it's a diagonal matrix, so there's only elements along these diagonals and those

elements quantify how much of the variance is explained by the different patterns.

Now the columns of V transpose tell you something about the relationship with

the column patterns or the patterns in the rows that we saw a minute ago.

So if I took the column means, this would be equivalent to looking at the column

means in the sense that it's looking for patterns that exist across multiple rows.

So there are a couple of mathematical properties of these.

They are calculated one at a time, they are orthogonal to each other.

That means they are uncorrelated with each other,

the columns of V transpose in the rows of U.

The columns of V transpose describe patterns across genes, and

the columns of U describe patterns across arrays.

So the other thing to look at is, it's a little hard to read here, but you see that

D sub I, which is the diagonal element, the ith diagonal element.

So D sub one is that element, and D sub two is this element, and so forth.

If you take D sub I squared and divide it by the sum of the remaining D of I,

you get the percentage of variance explained by the ith column of V.

So let's illustrate how that looks.

Remember our example here where we have 40 say genes and

rows and two sets of groups here, and so what we're looking for is patterns.

If we did the row means and the column means, we saw those patterns come out.

If you look at the singular value decomposition of this matrix you see

something similar emerge.

So, the first left singular vector, that's the first column of U.

It turns out to look like the same.

You see that there's a pattern for these rows and a pattern for

these rows that corresponds to the pattern in this matrix.

Similarly, if you look at the pattern that exists across by looking at the different

columns of this matrix, you see that there's a difference between these columns

and these columns and that appears in the first right singular vector.

The first ith in gene or the first principle component.

And that there's a big difference, again, between the two groups.

So it's not exactly the same as taking the means, that's important, and

it will become more important as you see later in the lecture.

But right now it does show that you can sort of pull out

low dimensional patterns from high dimensional data using this decomposition.

Now if I plot the D values,

the values along the diagonal of that matrix that's been calculated.

I can see that there's one large D value, and

then the rest the values sort of tail off.

So this is even more clear if you basically take each D value, square it and

divide by the sum of the remaining D values.

Then you get the percentage of variance explained.

And so nearly 40% of the variance in that matrix is explained by that first pattern,

which isn't surprising cause if you look at the pattern, it looks like about 40 or

50% of the rows have a strong pattern in them.

So this comes even more clear if you make the matrix like stupid.

So, for example, suppose that every single rose is exactly the same.

And there's a high value, it's constant, then a low value, it's constant.

In this case there's only one pattern in the data set.

There's no random variation at all and so the first singular value is very,

very high and the rest of the singular values are essentially zero.

If you calculate the percentage of variants explained then the first pattern

explains all of the variants in the matrix.

Which makes sense because there's only one pattern in the matrix.

So, the other thing that you can look at is you can look at how,

what happens if you have multiple patterns.

And so, for example here we're going to generate a new matrix.

And so this matrix, again has a small number, let's say 40 or so

rows and ten or so columns.

And the idea here is you're looking for two different patterns.

So first there's a pattern and that's high for the last five samples and low for

the first five samples.

And then there is a pattern that isolates from low to high, low to high.

So you can see that in the data set here too.

Here, you can see a block of samples that have high values.

And here you can see this oscillating pattern.

Low, high, low, high, low, high.

So in a matrix like this, when you do the singular value decomposition,

you'd like to identify more than one pattern.

And so it turns out, if you take the first two right singular vectors, you do get

two different patterns, but they're not quite what you would hope they would be.

So we generated the data set with a pattern that was low for the first five

samples and then high for the next five, and then oscillating low, high, low, high.

But it turns out the first right singular value vector is a combination of those

two things.

So you can see the first five samples are lower than the next five samples, but

there's still oscillation within each of these two groups.

So it turns out the second right singular vector also shows a similar behavior.

It's got a difference between the first five samples and the next five samples.

But then there's an oscillation in the pattern between within the groups.

So what does this mean?

It means there's a singular value decomposition is finding patterns that

explain the most variation.

But it doesn't necessarily directly decompose the patterns due to variables

that you think that you might care about.

And so it's not quite a perfect recapitulation

of the variables that generated the data set, but it does still give you some idea

of the patterns that you might see in the data set.

Again if you calculate the percentage of variance explained, so

here's the D values plotted from one to ten, because it's a diagonal matrix.

You can also see the percentage of variance explained is still very high by

the first pattern and the second pattern, and then it drops off.

So again we're kind of getting some idea of the dimension of

the true underlying variables that are sort of contributing to that data set,

as well as what they look like.

But they're not exactly the same, because of this requirement of orthogonality.

So how is this applied?

I was going to show you one example from genetics here.

So in this example, they took a genetic matrix that consisted of,

in the rows they had many, many, many snips, so single nucleotide polymorphisms.

And in the columns they had many samples from people from different

places throughout Europe.

And, so they calculate the first two singular vectors which

are equivalent to the first two principal components, PC1 and PC2 here.

And when they plot them, you can see that if you plot each sample according

to these two principal components, you see that they cluster by geography.

So for example, here you see the sort of the Spanish and

Portuguese samples down here.

You see Italian samples over here and so forth.

So you get basically an identification of the structure and

the genetic data that corresponds to the geographic structure.

And that makes sense because genetics tend to be associated or have patterns that

are associated with population structure, which is then associated with geography.

Because people tend to have a relationship with and

childrens with people that are close to them.

So there's a relationship between geography and population structure.

So another way this can be used is to identify patterns in a data set.

So again here I'm plotting PC1, or

Singular Vector One, versus Singular Vector Two.

And so what I'm trying to do is I'm trying to find distances between samples.

And I'm looking at the right singular vector that's looking at patterns in

the samples across rows.

And so here, each dot represents one sample and they're colored by

whether they're a human or a mouse sample from this specific study.

And then the symbol comes from what tissue did they come from..

So, if you look at this data set,

the distance between any two points in the plot is supposed to be a sort of

an estimate of the distance between those two samples.

If these PCs explain a large percentage of the variation, or

the singular vectors explain a large percentage of the variation.

Then that's a really close approximation of the distance between the two samples.

If they're not very close to each other then it's not a very close, sorry,

if they don't explain a large percentage of variation then it's not a very

good approximation.

So here you can see, for

example, that the testing samples from human and mouse are close to each other.

And the liver samples for human and mouse are also close to each other.

If you actually do a clustering you see that that's true.

You see testees cluster close to each other as do liver.

And so what does this plot suggest?

This would suggest that there's close relationship between tissues

than there is between species.

And so, another way that you can use this is you can actually try to identify

effects that are different between groups.

So here, what, this is it's an actual example that comes from this book.

And so, in this example,

they've actually taken a real data set and made a subset of that data set.

And so, the subset of the data set that they've taken is from two

different batches.

But then, within those two different batches they've taken some samples from

men and some samples from women and

they've looked at genes on the Y chromosome.

And so, here you can see, here are the women and the men from batch one, and

here are the women and the men from batch Two.

And so, you can see, for example,

that there are some genes that are very different between the two batches.

But there also are some genes that are different between the two sexes.

And so if you do the first singular value of this data set of the first

principle component,

you actually see that the biggest effect that you see is the batch variable.

So you can see that batch one and batch two are very different from each other.

And so you can use that to detect different variables in the data set.

Whether it's batch effects or whether it's group differences

by decomposing the data into smaller variables.

This is widely used like I said for batch effects.

This often comes up in technical artifact correction which we'll talk about later.

There are also many other decompositions people use.

They use multidimensional scaling,

independent component analysis, non-negative matrix factorization.

We're not going to cover those in this class,

because they're not as widely used as PCA and SVD, but they are other

matrix decompositions or ways to reduce the dimension of data that you might see.

If you want a lot of more discussion of this you can see it in this

Advanced Statistics for Life Sciences course, where they go into pretty deep

detail about these different matrix decompositions.