An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

来自 约翰霍普金斯大学 的课程

Statistics for Genomic Data Science

116 评分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

从本节课中

Module 3

This week we will cover modeling non-continuous outcomes (like binary or count data), hypothesis testing, and multiple hypothesis testing.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

For the most common kind of statistical inference that's performed in genomics and

indeed across a large range of genomic and biological sciences hypothesis testing.

Hypothesis testing gets a bad rap amongst statisticians because sometimes it can be

a little bit difficult to interpret.

And in particular, it's important that you get a null and

alternative hypotheses right when you're performing an analysis.

So what do I mean by a null and alternative hypothesis?

So, first, you're going to set up a hypothesis that you'd like to reject.

So in this case, you would say, maybe if you're doing a gene expression experiment,

and you're looking for genes associated with age,

the null hypothesis might be that there's no relationship at all between age and

the expression for a particular gene.

The alternative hypothesis would be that there is a relationship between age and

gene expression so that the relationship is non zero.

So when you set up these hypotheses, basically you're trying to set up

the thing that you're trying to reject if you collect enough data and

the thing that you're trying to see if that might be true.

So often you actually define the null hypothesis alone in that and

then you don't define the alternative hypothesis.

Some people don't like that in particular set up but that's often how it's done.

In other words, almost always the hypothesis that's being

tested in genomic experiments is that the relationship is exactly zero

versus the alternative that it's not equal to zero.

So that's not really defining a specific alternative you're just

trying to reject the case that the null hypothesis makes sure that there's exactly

a zero relationship.

So if we look back at our linear regression model so

suppose we're modeling gene expression as a linear function of age.

Then the way that we define these hypotheses in terms of mathematical

notation is that the null hypothesis is that there's no relationship.

Between age and expression.

In other words, this b coefficient is exactly zero.

And the alternative is that it's anything other than zero, so

it could be positive or negative.

We don't have a specific alternative in this case.

So the goal is to basically set up a scenario where we can quantify,

is b1 equal to zero and then say something about sort of

the basically try to reject that hypothesis and say something

about what is the distribution of the data look like under that hypothesis?

And are we seeing something that's very extreme with respect to that distribution.

So in general it's not possible to accept the null.

You're going to try to reject it, but it's not possible to accept it.

And it's definitely not possible to accept the alternative.

You can often reject the null, and you're going to say,

we're claiming in favor of the alternative.

But it doesn't mean that you've necessarily identified exactly what

the alternative is.

Ideally, when you're building your statistic to measure this relationship,

it's set up so that it's monotone.

In other words, as that statistic gets really big, or it gets really small,

then it's sort of less likely to be null.

In other words, it's more of an extreme observation from the null distribution.

If you change the variables that you're adjusting for

in your regression model, actually the null hypothesis change

because the null hypothesis depends on all the regression coefficients you're saying.

In the previous model we only had one coefficient that we were fitting and so

the null and the alternative hypothesis were very easy to define.

But if you have a large number of adjustment variables

then you're saying that the effect is equal to zero.

Once you adjust for all of the other variables variables in your model.

In general,

you have to be very careful to make sure that the null makes intuitive sense.

And you can really twist yourself into knots if you're not careful about

defining, very clearly in advance, what would be the no effect scenario.

In this particular modeling strategy.

And it's very important to get this step right.

This is the reason why hypothesis testing is often very highly

criticized by a number of statisticians is because it's very easy

to get the null hypothesis wrong.

Or at least to argue about what the null hypothesis is.

This is actually a case where this was actually discussed in very great depth.

On a blog, on Lior Pachter's blog,

he talked about a particular null hypotheses from a particular paper

that he disagreed with the way they had defined the null hypotheses, and

then discussed the how you would define that null hypothesis in a case.

I'm pointing you to this article because there was this very long discussion

in the comments section on this blog post about how to define the null hypothesis.

And I think it gives you some insight into looking at

how difficult it is to get the null hypothesis right,

particularly when dealing with complex and high dimensional genomic data.

But this is a point that's worth paying a lot of attention to.

Again I'd point you to this inference class if you want to learn a lot

more about hypothesis testing.

And the Statistics and R for the Life Sciences course also has a lot more about

inference and hypothesis testing in particular if you care about that.

But again the key point is to remember to get your null and

alternative hypothesis right.

And the null hypothesis in general, usually for genomics,

is defining what would this model look like if there was no effect?