An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

来自 约翰霍普金斯大学 的课程

Statistics for Genomic Data Science

116 评分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

从本节课中

Module 3

This week we will cover modeling non-continuous outcomes (like binary or count data), hypothesis testing, and multiple hypothesis testing.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

>> This lecture is how you calculate p-values and correct for

multiple testing in R.

So the first thing I'm going to do is I'm going to set up my plotting parameters,

just like I do in all the other lectures.

And then I'm going to load the packages that we need here.

The main ones we're going to be looking at are the limma, edge, gene filter,

and q-value packages.

So the data set I'm going to be using is the bottomly expression data set.

And so when I do that, I'm going to take out the expression data,

the phenotype data and the feature data so that I can model them individually.

And then I'm going to filter out the lowly express genes.

I'm going to perform a log transform.

So now what I can do is I can calculate p-values in a couple of different ways.

One is If you're using the gene filter package and

I just calculate f stats, using the real f stats function.

It automatically calculates parametric p-values for you.

So I can just look at the p-values directly.

So here we see a set of p-values.

One thing that you should expect to see is a spike of p-values near zero and

then flat out to the right.

So this isn't quite that, so that leads us to suspect that our model is wrong

somehow, maybe we missed an adjustment variable or something like that.

But overall, that's a very quick way to calculate p-values.

Another way that you can do that, is you

can do use the edge package to calculate the p-values for the not moderated model.

And so here, what I have to do is I build this study.

Using the express passing of the expression data,

telling you what group variable we want, strain in this case.

And then in this case let's say, we also want to adjust for the lane number.

So it builds the study then we can perform the differential expression analysis

using the likelihood ratio test on that study and calculate q values.

So then, we can look at the p-values that come out of that analysis by

looking in the q value object at the p-value argument.

And so here we see even more sort of strongly conservative p-value.

So you got after adjusting, so

that we still haven't probably gotten the model quite right.

They suggest that very little signal in the data set and that the confounding

due to the lane is not necessarily being enough to take that into account.

If we want the moderated p-values from the moderated statistics, the way that we

can do that is we can build the model matrix with the strain the lane number.

And then we can adjust here or fit the model

from limma, using the lmfit command.

And then we can shrink the model using the eBayes command.

To get the p-values again, you're going to use the top table function.

So to get the limma p-values, we're going to use the top table and

we're going to tell it to apply the top table to that eBayes limma object.

We're going to tell we want every single one of the p-values out,

there's numbers it's like the number of things through top things to report.

We're going to say the same is the dimension of data set and

then the element that we want out of there is the p-value.

So we're going to extract that part out and

then we can look at a histogram of the limma p-values as well.

And you can see they also have this sort of strange distribution.

To this case it seems to be regardless of the method that we use,

we see to get a little bit of a strange p-value.

So this suggests that like we might be missing a variable or

our modeling strategy might not quite be right.

Another way that you can calculate p-values is empirically.

So again if we're going to do some randomization we need to set a seed.

And then let's say we're going to do 1000 random simulations.

You can calculate the t stats using the gene filter,

rowttests function by passing at the data matrix and the strain.

And then you can create a matrix here, null t statistics.

So here, that matrix is NA to start, so these are the permuted statistics and

we haven't gotten yet.

So we're going to just say, let's make it full of missing values.

And it's going to have the number of rows equal to the number of rows of the data

set and the number of columns equal to the number of times you're going to permute.

So now we have our t stat, and our permuted t stats and

just to make it easier, I'm going to define.

The strain variable to be the strain from the p data.

So then what I'm going to do is I'm going to take this code which basically

loops over these 1,000 loops.

Every time on every one of the 1000 loops,

it assigns a strain zero variable to be a sample of the strain variable.

So what does that mean?

Every time it loops through.

it resamples, it rescrambles, or repermutes the strained data.

Then it recalculates the t statistics by applying the strained zero, so

it's basically using the permuted labels when it's calculating the null statistics.

So that's going to take a second cause the answer is round to a 1000.

Then you can use the Mp valve function from the q value package to calculate

the imperial p-values, which are basically the p-values from permutation.

So I pass it the original observed statistics and

I pass it the matrix of the statistics, and you can do one of two things.

You can either have each gene be compared to only the permuted statistics for

that same gene or you can pull the statistics for all the genes

together when you're calculating the sort of the permutation distribution.

So once I have the empirical p-values, I can make their distribution as well.

So it's the same sort of thing,

it's very conservative that we've been seeing over and over again.

So, then the next thing that I want to do is after I calculate those p-values.

I want to adjust for multiple testing.

So, that'll be the next video.