Learn to use tools from the Bioconductor project to perform analysis of genomic data. This is the fifth course in the Genomic Big Data Specialization from Johns Hopkins University.

Loading...

来自 Johns Hopkins University 的课程

Bioconductor for Genomic Data Science

124 个评分

Learn to use tools from the Bioconductor project to perform analysis of genomic data. This is the fifth course in the Genomic Big Data Specialization from Johns Hopkins University.

从本节课中

Week Four

In this week, we will cover Getting data in Bioconductor, Rsamtools, oligo, limma, and minfi

- Kasper Daniel Hansen, PhDAssistant Professor, Biostatistics and Genetic Medicine

Bloomberg School of Public Health

In this session we will discuss a popular limma packet and

a little bit about some general concepts from statistical inference.

The limma packet says, the limma stands for linear models for micro rays.

It's a package that was developed for analysis of gene expression microrays, but

it fits a classic model

that we want to fit a lot of different types of genomic data.

So while a lot of the terminology limma has to do with microrays,

it really being used for a lot of different types of data.

I'm using it myself for my own research on DNA methylation.

Linear models is a general class of models that we use,

roughly speaking, for continuous data.

There's other types of,

there's a couple of other analysis packages that are very popular.

Bioconductor, that deals with count data.

Specifically, negative binomial models for count data.

These packages have names such as, PE Seek, PE Seek Two, and Estar.

Estar is from the same authors as the limma packets.

A lot of these models basically treat each feature.

We have a set of features, in this case here, let's think of them as genes,

and we want to find genes that are differentially expressed.

A thing that's very important in genomics is that we have learned that

borrowing information across genes is extremely beneficial.

So in most gene expression data sets where many genes,

say 20,000 or 30,000 different genes, we have relatively few samples.

Sometimes you have six samples, sometimes you have on the order of a hundred

samples, but we have far fewer samples than we have genes.

We are essentially asking the same question for all the genes.

We can get a better analysis by borrowing information across genes.

That's usually done using a statistical class of statistical methods known

as empirical base methods, and it typically has to do with modeling

the relationship between gene expression and variance of the gene.

This all gets a little bit technical, but the common theme here is that you

gain a lot of power and performance by using these empirical based techniques.

That is something that is a little bit special to genomics.

In all these cases, we have some data and

we have something we will call the design of the experiment.

The design of the experiment tells us something about what samples were run

at what times, which rules do they come from?

Are the samples paired?

Do we have additional comparative information at the sample wants you

to take into consideration such as sex sample or the age of the sample,

or the age of the person where you take the sample from?

And these are all encapsulated into something we in

statistical terms call a design matrix.

We will discuss a design matrix a little bit later in this session when

we can see an example.

So let's start off by loading in the limma packets and

load an example data set called leukemiasEset.

So we see here that leukemia used as an expression set, it has 60 samples.

And the interesting variable in this data set is

a variable called leukemia type Leukemia types, so

this data set here, profiled four common subsets of Leukemia.

We have ALL, AML, CLL and CML, and

then finally we have the reference group, that's NoL, it stands for Not Leukemia.

So these are patients that didn't have Leukemia.

We, in this session we will avoid discussing complicated designs and

we will just go through a very simple two group comparisons where we imagine we

are profiling examples from two different experimental groups and we want to find

genes that are different in express We got to select the LL group and the AOL group.

So let's subset the data.

So now that we subset the data, we can have a look at our variable of interest.

This is where the variable that encapsulates everything about the sign we

are interested in, in this toy example.

Note that leukemia type is a factor.

Oh, I didn't take it from our data.

It's been a factor.

It has five levels, and these levels represent

types of leukemia that are not present in our subset of the data.

That's going to confuse some of the model fitting functions.

So we're going to clean up the data a little bit,

we're going to recreate the leukemia type factor, and the effect of that is we

are going to get rid of the three levels for which we don't have any samples.

Look, we have the same variable list but

now we only have two levels instead of five levels.

The first level in a factor is always important.

It's the first level that's being used as a reference level.

This case here we have ALL as the reference level.

I'm not going to fix that.

Normally you would do that.

The implication of this is that the differential expression we're going to see

is genes That are differentially expressed

in patients that don't have leukemia relative to those who have ALL.

That's a little confusing.

A more natural thing is to talk about genes that are differentially expressed

in ALL relative to not having leukemia.

The first thing we're going to do when we fill a liver more or

less we set our design matrix, which encapsulates the design of the experiment.

Setting up a design matrix is a general statistical task and it takes quite

a while to become fully comfortable with it There's a great introduction to

this in the limma user's guide and if you need more help which is very likely,

you can ask at the support forum or you can consult a friendly local statistician.

I'm going to emphasize that setting up a design matrix for the new experiment

experiment usually does not require very much insight into genomics.

It's a statistical task that doesn't have much to do with genomics,

although we use it a lot.

So, let's have a little look at the design matrix so we can see what we have created.

It's a matrix where the number of rows is equal to the number of samples, and

we have two columns which describe two parameters in the model.

We have an intercept parameter.

In this case here, the interpretation, the intercept parameter is that this is

the average gene expression for a given gene across all our samples.

Sorry I crossed the ALL samples.

The second column, which is really the parameter we're interested in,

is the difference in gene expression between NOL and ALL.

If this parameter is equal to zero, it means that the gene is

not differently expressed because there's no difference between NOL and ALL.

So setting up model matrices and using this model matrix thing and

being sure that you interpret the coefficient correctly is, as I said,

a task that's completely beyond the scope of the session here.

So we'll continue this.

We see this example a lot.

This is probably the most common set up a two group comparison and

it's important to feel somewhat comfortable with it.

I'm going to note that what we're going to do here,

find the different expressed genes between two groups,

is we're going to do this with something that is very similar to doing it teaches.

Well let's wait on that for a moment.

Now we have the design, and

the first think we do in limma is we fit the basic model for the design.

So, in statistics, we separate the model from the hypothesis.

The model tells us in this case here, something about that we have

24 samples and they have been, there are two groups of samples.

The hypothesis is that there's no difference

between the samples in the ALL group and the samples in the NOL group.

So we round LM fit which fits a linear model to all the genes separately.

So at this stage here, at the output of LM fit,

we've done nothing where we borrow information across genes.

That happens in the next step where he

does an empirical base estimation of the variability.

Now this is again, completely outside the scope to explain what this really does but

essentially it shows that the variability of a gene

is a mixture of a gene specific variability and

a variability where you assume or global variability where the global

variability assumes that all genes have the same variability.

Assuming that all genes have the same variability is clearly

not right biologically, but assuming, but

on the other hand, estimating the variables for each gene separately

means we are estimating a variability with only 12 samples in each group.

That's not a whole lot of data and what happens in the empirical base

step here is that we improve our variance estimation by forming this mixture.

Okay, so let's just do it and now we filled it and

we can look at the top different expressed genes.

So let's step back a little bit here.

Why is there only one list of different express genes?

That's because we have a design, where we have two group comparisons.

When you only have two groups and

you have no covariance there's only one comparison that's worth making.

That is asking whether or not the two groups have the same expression level.

So when I run this top table thing here, I don't have to explain what coefficient,

or what chest I'm interested in.

We're going to expand a little bit on this below.

So this here we have to get some columns out.

We get a lock fold change, which is a fold change between ALL and NOL.

We get a T statistic, we get a P value.

We get just an adjusted P value, which is adjusted for multiple testing.

That's usually what we're interested in.

We're also interested in the full chase estimate.

Now remember that our factor busted off so

that reference level is ALL which means that the fold

chains we have here is the fold chains from NOL to ALL.

So we can actually compute this by hand just to make sure the we have

the right interpretation.

So let's just get the first gene, and we're going to get the gene name so

we can look it up in our expression list,

this is really just like a feature name.

And now we're going to compute the expression for each,

the mean expression inside each group.

I do that with this little C apply command here which says I take my expression level

for only one gene, and I do a mean inside each level of the leukemia type thing.

So what could I get out of that?

This is the expression level in the two groups, ALL and NoL, and my

log fold change is NoL minus ALL, because of the reference level in the factor.

So again, this is a little bit where it's positive

because a gene is really actually, NoL and ALL is and isn't an OL.

Now, let's try to do this again.

So, what we've done here is really similar to a cheap statistic.

It's something known as a moderator gene statistic, because we have estimated

the variability of the data in a different way from a normal t statistic, where we

have utilized that we have many thousands of genes, and we utilize that information.

Now let's go back and refit this model here,

where we very specifically explain what contrast we are interested in.

So we set out a new model matrix we call design two that is very similar

to design one, but notice this -1 I have here.

Again, we won't go into details but

essentially it means that I get a new matrix where

the column names are slightly different.

And the true parameters in my model really what I get out of this

model matrix is that the first parameter is the expression level in ALL, and

the other parameter is the expression level in NoL.

Remember before, the design matrix had two parameters, and

it was expression in ALL and then the difference in expression from ALL to NoL.

I'm going to change the name of these, the columns of this thing here,

just because this is a little bit much to type.

I have ALL and Nol.

And now, I do the same thing as before, I fit it, but now,

I have to make a contrast.

So a contrast is a hypothesis, and

we're going to test, we're going to call this function called makeContrasts,

and I hope you can get a little feel for how it's run.

So I'm saying I want to have a contrast that's ALL minus NOL.

This is the opposite of what we did before,

we allow using NOL as the reference level.

I say I take my levels from this design matrix, and

I get a little this contrast matrix is really, quite simple, and

what this contrast test is whether or

not these two parameters with this linear combination is equal to zero.

So it's going to test whether ALL-Nol is equal to zero.

So now that I have a contrast I have to do something called contrast.fit and

then I've got to run my eBayes a module born from racing across genes,

and now I can look at my top table.

Now my top table here is relative to this specific

contrast I have inside my That I used in the contrast upfit.

So now note that I changed the sign of the log fold change, and

this here is essentially exactly the same as it did in the beginning it's just using

a slightly different setup.

It uses contrast instead of having the contrast implicit in the design matrix.

So I hope this is going to be a good beginning for

you to learn about design matrices and statistics, but

we're really not going to cover more on this in this class.