An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

来自 Johns Hopkins University 的课程

Statistics for Genomic Data Science

124 个评分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

从本节课中

Module 3

This week we will cover modeling non-continuous outcomes (like binary or count data), hypothesis testing, and multiple hypothesis testing.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

So permutation is one of the most widely used tools in genomics, and so

I'm going to show you a little bit about how to do permutations in R.

But there's actually a whole literature devoted to how to do this depending on

which exact problem you're working on,

the permutations key might be a little bit more complicated.

So here again I'm just going to load up the usual settings for

the plots, and I'm going to load the library that I'm going to be using here.

And then I'm going to load the data in this case again from the bottomly

experiment where I'm collecting information on the bottomly data set.

And then extracting out the p data, the e data, and the f data, so

the phenotype data, the expression data, and the feature data.

So again I'm going to do some transformation of the data and

some filtering.

So I'm going to transform the data onto the log2 scale and

I'm going to remove the lowly expressed genes.

And so the first thing that you could do we talked about when calculating

statistics is you could calculate the statistics using the rowttests function.

So here I'm going to do that for this strain variable, and so

I can look at this, the observed statistics calculated from

doing that by making a histogram.

And so there's the distribution of the observed statistics.

And then if I want to do permutation,

the first thing I might need to do is set the seed.

So the reason we set seed is to basically make sure that the random numbers,

the pseudo random numbers, like the generator's the same every time.

So whenever you're using permutation or any other randomization based algorithms

you need to set the seed when you're doing a reproducible document so

that it's always the same numbers that come out.

So you need to pass it some number, here I'm using 135,

it can be any positive integer.

And so then I'm going to define the string variable to be the p

data string variable and then to permute I just use the sample command.

So what that does is it just scrambles up the order.

So if I look at the strain command or strain variable, it's in this order.

If I do strain0, I get a slightly different order for the strains.

So for example, you can see here the third element,

they're not the same between the original and the permuted version.

So it does that at random.

And so then I can basically recalculate the statistics, but

now using the permuted sample labels.

So when I do that I get something that you would expect.

So you've broken the relationship between the strain variable, ideally, and

the expression data and so

hopefully you would get statistics that are a little bit smaller.

Because you've basically broken that relationship.

And so the one thing that happens that's interesting here is that at least for

this one permutation, you see the statistics are mostly positive.

So that's likely, when that happens that's often because there's a covariate

that you haven't modeled.

And so you might want to repeat the calculation over and

over to see if you get something different on different permutations.

And so another way to see that is you can look at the quantiles of these statistics.

So if I look at the quantiles of the permuted statistics,

I can see that they tend to be positive.

And then if I look at the quantiles of the observed statistics they're sort of

more symmetric.

And so that could be a sign that there's a batch effect or something, so that when

I permute, the batch ends up being kind of correlated with this permuted strain.

And it looks like there's an association signal when there isn't necessarily one.

So that's how you permute in R.