An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

来自 约翰霍普金斯大学 的课程

Statistics for Genomic Data Science

116 评分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

从本节课中

Module 1

This course is structured to hit the key conceptual ideas of normalization, exploratory analysis, linear modeling, testing, and multiple testing that arise over and over in genomic studies.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

Often to visualize or model genomic data is you need to make some

data transformations that put it on a more appropriate scale or

a scale that's easier to interpret.

So the first thing I'm going to do just like always is I'm going to set up

the plotting parameters to be the plotting parameters that I like and

then I'm going to load the packages that are necessary for this tutorial, and

then, I'm going to load in a data set here.

And so, again, I'm going to load in the bodymap data set,

just like I have several times.

So the basic idea is I'm going to connect to a particular URL, here.

Then I'm going to load the file from the connection, close the connection, and

then set up some of the variables.

I'm going to make the variable name a little bit shorter and

extract the phenotype data, the expression data, and the feature data.

If you followed previous lectures, you'll see how I've done that in each of those.

And so, now the first thing that I'm going to do is I'm going to start looking

at some plots.

So, first thing you do is you can look at a plot of just some sort of simulated,

normal data.

So this plot is a plot that you would sort of typically see in nice distribution.

But then if you look at the histogram of data from say the first sample,

from the expression data set, and if I

make that histogram I can see almost all of the values are equal to zero.

If I increase the number of bins in the histogram,

you can see,it's really almost all the values are almost exactly zero with

just a few values that are very, very large.

And so the first thing that you might do is you might

do some kind of transformation.

The one transformation is to put things on the log scale and

so the log scale basically puts things on a multiplicative

scale that makes it a little bit easier for you to visualize the data.

So often when you have highly skewed count data it's a very common transformation for

people to apply.

And so if I do that you can see a little bit better the distribution.

But this is a little bit deceiving.

One of the reasons you can see the distribution better is because when I

apply the log to this dataset, the log of zero is actually undefined.

So if I look at the minimum of the log these values I can see it's negative

infinity, and so, actually if I look at the quantiles of that distribution

I can see that it can't calculate them very well because basically

you have these values that are in negative infinity for

almost all the values, because you mostly have zeros in the data set.

And so the first thing that you can do to sort of avoid that problem

is that you can, instead of having a count of zero, you can just add a small number,

add one, that won't effect any of the big counts,

and it'll make the log transformation defined for all the rest of them.

So if I do the histogram of the log of the counts plus one,

now I can see this distribution.

You get the zeroes back, so you get this big spike of zeroes, and

then you can see the distribution to the right hand side.

So the other thing that people do is instead of the log transform, they often

do the log two transform, so that's just a different base for the transform.

And so the reason why they do that is that they want to be able to basically

compare two values, and when they compare two values on the log two transform,

if you take the difference of those values,

it's sort of equivalent to the log of the full change.

And so now when you compare the two values to each other,

if you get an increase of one it's sort of like a doubling of the data size.

And so that's a useful metric for measuring how related two things are.

And so then the next thing that you might want to do is you might want to say,

well even once I've done this sort of transform I see

almost all the values over here are still stuck right near zero.

So it's very hard to see.

One thing that you could do is you could just zoom in.

So here I'm going to make that same histogram I did a minute ago but

now I'm going to set the x limits to be between one and 15.

So that's going to basically ignore the zeroes and

just make the histogram out here between one and 15.

And so when I do that I can see here that

the nice part of the distribution, but again, I've sort of ignored those zeros.

So we can actually count up for each row or

each gene, how many values are equal to zero.

So I'm going to take the row sums

of the number of times that the expression data is equal to zero.

And so, when I do that, I can see that there's a lot of

the genes that have basically all zeroes that are like almost every value is zero,

and then there's a few where almost all the values are non-zero.

So one thing I can do is I can remove the low value genes.

So one way to do that is by just taking the mean for each row and

saying if the mean is less than some number, here I'm picking five,

you can pick a different number.

We'll say it's too low.

It's a low gene.

So there are 41,000 where it is too low and 11,000 where it's not.

And then what I can do is filter the data set.

Data Frame.

And I can say only keep the ones that are not low gene.

So the bang or exclamation point here means not low genes.

Means it's going to flip this around so it'll say true for

all the genes that are above the threshold.

And so now I've got a new data set which I've filtered out which actually

has only the 11,000 genes or so that have an average count greater than some value.

The average isn't always necessarily great for

count data because you often have these really high values.

So if I look at a summary of the e-data.

So, for example, for this sample almost all the values are equal to zero,

but the mean is relatively high because you have this gigantic maximum value,

but you can see the median isn't as affected.

So the median is just the 50th percentile, so even if you have like one

gigantic value but all zeroes it will be a little bit more robust to that, and

so people often do this filtering based on the median.

So you can actually use the row medians command to find all those genes that

have a row median count less than five, so when the median count is less than five.

And then if I make a table of those low genes versus the low genes for the mean k,

so you can see that most of the time they're both true at the same time.

But sometimes, when the mean is still high,

the median is low, and so we still filter some more of those out.

And so then if I basically make a second filtered data set,

it's those that are filtered out based on their median values by saying,

not the low genes on the median scale.

Then if I look at the dimensions of that,

I can see that it's now dimensions that are a little bit smaller than the one I

did when I filtered on the basis of the mean.

So, once I've done that sort of filtering,

then the transform plots also are a little bit easier to see,

because you've basically removed all those genes that had really low values.

And so the histogram will still have some zeroes here, but

you've filtered out a lot of the ones that are exactly zero and so

the histogram makes it a little bit easier to see.

So that's a little bit about transforms and filtering low count observations.