An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

来自 Johns Hopkins University 的课程

Statistics for Genomic Data Science

172 个评分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

从本节课中

Module 3

This week we will cover modeling non-continuous outcomes (like binary or count data), hypothesis testing, and multiple hypothesis testing.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

So after fitting many regression models,

the next thing that you might want to do is

calculate the statistics that we're

going to use to quantify

the relationships that we care about.

So I'm going to do that with a couple of examples here,

and also I'm going to set everything up

like I usually do for the plotting,

and then I'm going to load the packages that we need.

So the key ones that we care

about for this analysis here are the limma package,

the edge package, and the gene filter package.

So the first thing that I'm

going to do is I'm going to load this

bottomly data set that

we've used a couple of different times,

and then I'm going to see that I've

got this bottomly dataset

loaded and I've extracted

the expression and phenotype data.

So again I'm going to do a transformation

on the data and then I'm going to

remove the rows that aren't above a particular threshold.

So in other words, the lowly expressed genes,

so then it makes everything a

little bit easier as we go along.

So the first thing I might want to do is

just compare two groups.

So you can do that by calculating the t-statistic

for comparing those two groups for every single gene.

So one way to do that is to the slow way,

the fast way is to compute them

all at once using the gene filter package.

So you can do that using the rowttest function.

So I'm going to calculate

the rowttests I'm going to pass it

the data matrix of the gene expression data,

and then a factor variable telling it

I'm going to compare the two levels of the strain.

So the result of that is

the t-stats object which has a statistic and the p value.

Here we're going to look at the statistics,

we'll talk about p values later.

So if I do that tstats_object and look at this statistic,

I can see this is the distribution of statistics of

one statistic per gene that's been calculated.

It turns out that that's not going to work,

this rowttests only works

if there's a two group comparison.

If you want to do a multi group comparison,

you can still do use the gene filter package,

you can use the rowFtests function.

So if I do rowFtests,

I pass it again,

the expression data object and

then I have to pass it here,

any variable that has two or multiple levels.

So here I'm going to use lane number

which has multiple levels,

so I calculate the F-stats.

So just to be clear,

the lane number variable it can

have one of eight different levels.

So now it's going to compute,

it's looking for any difference at all

between any of those possible differences.

So then the result is

again an object that has stats and p values in it,

and so I can look at the statistics from this one.

This is now one of those F-statistics

that we talked about that it's calculating.

So these are all positive compared to

the t-statistic which had a symmetric distribution.

So that's where you can do it If you

don't want to do any adjustment,

you can also do no adjustment version

of the moderated statistics using limma.

So here we have to create the model matrix

for the comparison we care about,

here is the strain,

and then we can fit the limma model

using the lmFit command.

So I pass it the data matrix,

the gene expression data matrix,

and the model that I want to fit, fits the model,

and then I can shrink

those statistics using the eBay's command.

Then I can look at the statistics that come out,

you see that it actually

calculates a statistic both

for the intercept and for the strain.

So you have to select out the one that you care about,

in this case we care about the statistics

for showing the second one.

So here I can plot that statistic,

the second column of the t-statistics versus

the t-stats from the gene filter package.

So on the x-axis I have the moderated t-stat,

and on the y-axis,

I have the t-stat.

So you can see that they're not exactly the same,

but they're very highly correlated with each other,

and you can see for example that

the moderated t-statistic has

a different distribution than the regular t statistic.

So then I can add

a color to that or aligned to that, a 45 degree line.

You see they're very similar.

Basically, mostly during the edges

are different from each other.

A key thing to note here

is that I had to multiply by

a minus here in front of the t-stats,

and that's because basically, we don't know,

there's not a natural direction

for this t-statistic to go.

You have to make sure that the effects

are going in the same direction.

So you can't do

an adjusted statistic using the gene filter package,

but you can using the limma package.

So here I create

a model matrix that has the string variable,

it also has the lane number variable as well.

So then I can just fit that

using the lmFit command just like I did before,

but with the adjusted model matrix.

Then I can calculate the shrinkage estimates.

Now I get t-statistics that are shrunk for the intercept,

the strain, as well as the all the different levels

of this late lane variable.

So now I'm going to get very different results.

So if I plot here the results that I get out from

the adjusted model versus

the t-stats that I got from gene filter,

they'll be very different because I'm no longer in

one case I'm actually doing an adjustment,

and in one case I'm not doing an adjustment.

So here you get

different statistics all the way through

including in the middle of the distribution.

So I can add

the 45 degree line and you see that

there's still highly correlated with each other,

but they're no longer the exact same.

In fact, you see a little bit off the 45 degree,

the coefficients from the adjusted statistic.

Okay. So another option that you can do is you can

actually fit factor model

for multiple level factor using limma.

So here I'm going to create the model matrix for that,

where I have the lane number factor level,

and then I can fit the model just like I did before,

where I'm just passing it the new model matrix

for lane and the expression data.

Then I can do the same empirical Bayes estimation

of the statistics.

So now you can see I have

statistics just for the lane variable.

So suppose I wanted to do a comparison of,

is there any difference between any

of these lanes whatsoever,

I can use the top table function to do that.

So if I want to find

the top statistically associated genes with the lane,

I can pass it moderated fit

from the limma analysis for lane,

and I can tell which coefficients I want to look

at to calculate the statistic for.

So in this case,

I see that it's the second coefficient

starts the lane and

the last one is the seventh coefficient.

So then I say, how many do you want me to report?

So it's a top table so it'll report then

the n most statistically associated genes.

In this case, I want to see all of them.

So I'm going to say, report every single gene.

Then I don't want it to be sorted so

that it's in the same order as the gene that I have.

So I have this top lane variable

which gives me the F-statistic, the p value,

and the adjusted p value for

these calculation where you're looking

at the association with

any of the levels of the lane variable.

So I can plot that versus the F-stat

from calculating the same thing

with the gene filter package,

and it's going to be a little bit different here because

again we're doing a moderation of the statistics,

so it's not quite the same thing.

I can do the same thing in edge,

and again this might be a way or work

if you don't have as much knowledge and model matrices.

So you can again build this study

using the expression data,

you can tell what group variable you want to do here.

I'm going to say, use the lane number

as the group variable,

and then you can calculate the differential expression

using the LRT command

that stands for likelihood ratio tests.

So once I've done that,

I can actually also calculate

the Q values that go with that,

but one of the components of that Q value

is the stat variable.

So I can use that stat variable to

calculate a comparison with

the F-stats from the gene filter package.

So in this case, they're exactly

the same because there's no moderation.

So the statistics you get from gene filter and

the statistics you get from edge are exactly the same,

but with edge you can do an adjustment variable.

So here I'm going to again build the same study.

I'm going to use a build study,

I'm going to pass it the gene expression.

I'm going to tell it the group is the lame number,

but here I want to adjust for a strain.

So I'm passing it an adjust variable with adj.var.

So now I have this second edge object.

I'm going to be the second differential

expression object,

calculate this second set of key values.

So now the F-stats might be

different because I've adjusted for variables

and now you see that the F-stats are same

once you adjust for the strain.

So those are three different ways that you can calculate

statistics for many regression models in R.