An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

来自 Johns Hopkins University 的课程

Statistics for Genomic Data Science

124 个评分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

从本节课中

Module 1

This course is structured to hit the key conceptual ideas of normalization, exploratory analysis, linear modeling, testing, and multiple testing that arise over and over in genomic studies.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

One of the most fundamental concepts in statistics for genomics,

or really statistics for anything, is experimental design.

So here, we're going to talk about three of the key ideas, variability,

replication, and power.

So, the first thing to keep in the mind is the central dogma of statistics.

So, the central of dogma of statistics states that we have some population that

we'd like to measure things about.

So, this is the population that we have over here.

And we have some, in this case,

shapes, and we want to measure the fraction that are a particular color.

The way that we do that is that we use probability and

sampling to get a smaller subset of these objects, and measure something on them.

So, in genomics this might be the population of all people that have

a particular cancer.

We might sample a subset of them, ideally randomly, but

almost never is that the case.

And then measure something about them, say, the gene expression profiles or

measurements about their genetic variance, or something else.

We then want to use that information to make inference about the population.

So we want to take statistics and basically,

summarize these data that we've collected on this small sample, and

see if we can say what's the proportion in the big sample up here.

So, what happens is there's a couple of different kinds of variability that

get introduced.

But the main one that we talk about a lot in statistics is sampling variability.

We use probability here to select which samples that we wanted to look at.

Now, it could be that set of samples, or in a different case,

you might get a whole different set of samples.

And so you might get variable estimates of what we think is going to happen in

the population.

So, it turns out there's three major sources of variability in

most genomic measurements.

So, the first is phenotype variability.

So, almost always in a genomic experiment,

at least if you're talking about a genomic experiment in humans.

But even in other organisms as well, you want to measure variability, say,

between cancers and normals, or between two different levels of output of a crop.

So, there's variability in the measurements due to that phenotype.

There's also variabilities due to measurement error, and this is a big one.

So, the measurement error can either be sort of random measurement error,

that sort of happens to every measurement as we go along.

It can also be measurement error that's correlated or biased.

There are all sorts of reasons why that you might measure things

differently between different samples that aren't necessarily down to biology, or

the phenotype that you care about.

And so that often comes in the form of batch effects or

other things like that which we'll talk about later.

Finally, there's natural biological variation.

This is one of the reasons why a genomics is so exciting, but also so hard.

Any two individuals will have variability in their genomic measurements

just because they're different people.

Not necessarily due to any specific biological difference between them, or

due to any measurement error.

In fact, you can take the exact same person and

measure their genomics repeatedly over time of the same kind of tissue

in the same exact way, and you'll get variation from time to time.

Because there's natural variability in say, the amount of, that each

gene is expressed, or what variants are present in which cells in the body.

So that natural variation is something that you have to account for

when modeling things with statistics.

So, the other thing that you need to keep in mind when you're doing

experiments is how are you going to measure those types of variability.

So, there's two big type of replicates that people often do in experiments.

So, replicates are when you do an experiment, and you want to do it more

than one time because there's so much variability in the sample.

You want to be able to measure that variability, and make an estimate of how

uncertain you are about certain statements that you're making.

So, the first kind of replication is technical replication.

So, this is where you have some sample, say, it's a sample that you've already

collected from somebody, and you're going to run genomic measurements on it.

So, what you do is you do some sort of processing of the samples, and

you do that processing two separate times.

You do the exact same processing, but you just repeat it twice.

Then you get what's called technical replicates.

So, these are replicates that just replicate the technical part of

the process.

Now, there are different kinds of technical replicates,

because there's obviously multi-step processes that generate these replicates.

So, you can have different kinds of technical replicates, but

in this case it's always the same biological sample that you're considering.

The other kind of replicate, and one that's very important to

collect in genomic measurements, is biological replicates.

So, this is where you take different people, or different organisms,

or different samples from different tissues and you prep them.

You, again, prep them the exact same way.

But here you're prepping two different samples, and so because you're

prepping two different samples, you get two biological replicates.

So, biological replicates measure something about biologic variability,

not just how variable the machine was in making those measurements.

So, it turns out that natural biological variation in biological replicates

is a sort of a natural quantity regardless of the technology used.

This is a figure from a paper that shows two genes, so

here's gene one in column one and gene two in column two.

And they're measured with two different technologies.

One is with RNA sequencing, and one is with RNA microarray.

And so you see that gene one is not very variable regardless of whether

you measure it on sequencing, or whether you measure it on microarray.

Each of these dots represents 1 individual across 60 individuals.

They're the same individuals that were measured with the two technologies.

And you can see that the variability of this gene is small regardless of what you

use to measure it.

Similarly, if you look at this gene, it's highly variable across people

regardless of what technology you use to measure it.

And so you could see that variability, natural biological variability, is not

impacted by the choice of technology that you use to measure it with.

So you need to separate those two things when you're designing your experiments.

You need to account for how are you going to measure both the biological

variability, and the technical variability.

So, another thing to keep in mind is how many of these replicates you might

want to get.

So, if N is the number of measurements,

there's a very famous sample size formula which is the number of dollars that

you have divided by how much it cost to make each measurement.

Now, obviously that's a little bit silly, but

it is a formula that gets used [LAUGH] a surprising amount of the time.

But there are better ways to calculate how many samples you need.

And so, there are typical sample sizes that come in various different kinds of

genomics experiments.

So, if you're looking at a rare mendelian disorder,

so this is a disorder that will be passed on.

It's highly penetrant, so

everybody that has a genetic variant in a family might get that same disorder.

Then, they typically have relatively small sample sizes.

This is largely because they're rare.

So, it's probably hard to find people that have more than that.

Ideally, you'd have a bigger sample size, but that might be all you have.

A typical RNA sequencing study, or study that measures gene

expression can vary somewhere between 10 biological replicates,

and 1,000 biological replicates.

1,000 would be a very big study.

10 would be on the small side, but

10 is just enough where you can really start to measure biological variability.

Similarly for most epigenetic studies, like a DNA methylation study, originally

they would be relatively small samples sizes, say, 10 or even fewer samples.

But now with epigenome association studies or the EWA studies you're seeing

up to 1,000 of samples of the DNA methylation being taken.

Finally, for a common disease genome-wide association studies.

So, these are diseases where there's a genetic variant, but

we don't expect it to be perfectly linked with any particular trait.

It might be a relatively subtle effect size, we want to measure lots and

lots of people.

So, there the smallest sample size you might even be able to get away with is

10,000.

And now, there are even studies that are being planned or being run with millions

of individuals that are being measured for their genetic variability.

So, the sample size depends on a couple of different things.

It depends on the availability of those samples.

It depends on how the variability manifests itself in the genomic data,

and it depends on how big a signal that you're going to see.

And the key thing to keep in mind is that sequencing technology does not

eliminate biological variability.

So, there's sort of a rush almost every time that a new genomics technology comes

out to run an experiment with even a very small number of samples, and

you can get a really high profile paper.

But you won't necessarily be able to get reproducible results because you won't

be able to measure how uncertain you are about

those quantities that you're measuring.

So again, keep in mind that you get natural biological variability

that corresponds to specific genes that have specific variability regardless of

how you measure them.

This is actually often ignored.

So, here's an example.

This is a table that we published in this paper, and

this is a table of a large number of early RNA sequencing experiments.

And these RNA sequencing experiments were published in very good journals.

And they were very interesting from a technological perspective, but as you can

see, the number of biological replicates and technical replicates that they have.

So, this is the number of technical replicates in this column and the number

of biological replicates in this column, is often very small with sometimes one, or

two, or three sample of replicates in each of these cases.

So, why is having such a small number of replicates a problem?

Well, the reason is because you won't have enough power to discover what you're

looking for.

So, what's power?

Power is a statistical term that means the probability

that you will discover the real signal if it is there.

So, power is a function of a large number of things.

It's a function of the signal size.

It's a function of how variable your measurements are.

It's a function of the number of samples that you've taken.

So typically, in a clinical study, say, for example power is set to be 80%.

So, you want to get enough samples, so that there's an 80% probability that

if there's a real effect of the size you believe will be there,

that you'll actually be able to do it.

But it turns out that all power calculations are a little bit made up,

in the sense that you have to guess in advance,

how big is the signal going to be?

How variable is the data going to be?

Often you know a little bit about that from previous studies, but

you never know exactly what that's going to be.

The real point though is that if you haven't really considered how strong your

chances of making a discovery, you won't be able to discover reproducible results.

So, higher power is in general better, and

it's a good idea to do a power calculation before running your experiment.

Because the worst thing in the world is to have run an entire experiment, and

then not have enough power to detect anything and to have to cancel

the experiment, or to actually have to run it and then you lose the money.

Lower powered studies tend not to replicate.

In other words, even the things that you discover turn out to often be just due

to noise or due to some sort of bias, like the winner's curse where you'd identify

things that are the largest effect sizes even if they're not real effects.

And so in general, you want to design your study to have high

enough power to detect what you're looking for.

So, here are a couple of things that,

imagine you're just trying to compare two measurements.

You're trying to compare the measurements for X and the measurements for Y.

Here, we're going to ignore what they are for a minute, and

suppose that there is a real difference in the two means.

And so the mean of the measurements for X is shown at this blue line, and

the mean at the measurements for Y is shown at this red line.

And so the goal is to find the difference between these two.

So, you can imagine that it'll be easier to find an effect if the blue line and

the red line are farther apart from each other.

Another thing, it'll be easier to find an effect if the red measure,

the measurements for the Y measurements are actually closer together.

If they were all very clumped very closely around this red line, and

all the blue measurements were clumped very closely around this blue line,

they'd also be easier to see.

The other thing to keep in mind is imagine you only got two measurements.

And imagine for the Y it was just this measurement and that measurement.

And for the X it might be say, this measurement and this measurement.

Then, this mean would stay the same, but

this mean would shift all the way over here.

So, it would look like they were closer together than they were.

So, again, the number of samples will help you, tell you a lot more about

what the actual average is, and whether they're different from each other.

So, here's a little quick example.

I'm showing you some R code, but you don't need to worry about it too much.

It's just to show you that the power is a function of the sample size,

the number of samples in each group.

The difference between the two groups, the effect size, and

the variance, something about the variance.

In this case, it's the standard deviation.

We'll talk about what all these concepts are later in the class.

But just for now, if you think about it just conceptually, if you do

the power calculation for a sample size of 10, with a difference of effect of 5,

and a standard deviation of 10, you get a power of about 18%.

So, you can make that calculation just by plugging these number into

the power.t.test functions.

You can also do this in the reverse way.

So, suppose you want to get to an 80% power,

which is that power that people have in clinical trials.

And then you know that the effect size is going to be something like 5.

And you know that the variance of those sampling variance of those measurements,

then you can actually use the power.t.test function

to actually get the sample size that you need in each group.

Similarly, you can change different parameters in this.

So for example, if you know for example that the effect is going to be a positive

effect, there's always going to be an effect and then direction of the X

measurements, then you can make it just a one-sided calculation.

This is often not the case in genomic measurements.

We often don't know that one particular gene or

variant will go in one direction or the other.

But if you do, you can get a slightly higher power.

In other words, here, to get an 80% power for

a difference of 5 with a standard deviation of 10, you needed 63.8 samples.

Here, you only need 50.2 samples.

So again, this is a function of a few different key quantities.

So, what you can often do, rather than making specific assumptions,

is make these sort of power curves.

So, here I plotted on the x-axis the difference in the means,

the real difference in the means between the two samples.

And here I plotted power on the y-axis.

And so as the difference in the means becomes either very far apart negative, or

very far apart positive,

you see an increase in power in each of these two directions.

So, it's easier and easier to detect, the probability of detecting the effect gets

higher and higher as the difference between the two groups gets higher.

The power also gets higher, here the colors represent the sample size.

So, black is n=5, blue is n=10, red is n=20.

And you can see that there's an increase in power as the sample size increases.

It also increases if you know the real standard deviations versus just guessing

what they are from previous studies.

So in general, you can make these curves and

get an idea about what the power is for your study.

And you can use that to plan out how many biological and

technological replicates you need to get in order to make sure you have a good

chance of actually discovering something.