An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

来自 约翰霍普金斯大学 的课程

Statistics for Genomic Data Science

116 评分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

从本节课中

Module 3

This week we will cover modeling non-continuous outcomes (like binary or count data), hypothesis testing, and multiple hypothesis testing.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

Once you fit a statistical model with a regression model,

say a linear regression model that's just a regression model,

the next thing that you might want to do is perform some sort of inference.

So, remember that that model is a fit to the data that you collected.

But again, according to the central dogma of statistics, we actually want to take

the sample that we collected, fit some regression model, and then make some

statement about the relationship between the variables and the population.

So again if we have some variable that we care about in the population, say it's

the number of pink symbols, the fraction of pink symbols in the population, we

actually get an estimate of that fraction based on the sample that we've taken.

Similarly we would get an estimate of the regression coefficient

between the variables in the [INAUDIBLE] population.

So what we want to do next is we want to quantify how that relationship is and

how much uncertainty we have in our estimate from the sample.

So, for example you could give a totally different estimate if you took a different

sample and you want to know how much variability do we actually expect?

And so this matters a lot by different observations.

So again look at this example where you have three different genes and

you can imagine if you had two groups and

that in gene one you can see that there is a difference in expression between the two

groups, but there's also a relatively good amount of variability.

For gene two there's a difference in expression and

there's very little variability so there's definitely a very clear difference

that we would likely expect to see again if we repeated the experiment.

And for Gene 3 there's a small difference but there's also a tiny variability.

So this might actually be a replicable but small difference between the genes.

So the variability is a quantity that matters a lot.

So if we go back to the galton$child height data,

imagine that we're trying to estimate the mean height in the population.

So what we could do is we could estimate the mean height in the sample by

taking the average.

Then we could estimate something about the variability.

So one way to measure variability is to take the standard deviation,

or the variance, and in this case we can estimate the variance by

taking the sample mean and then calculating the distance to the sample

mean from every other data point and squaring it.

So again, this is that distance calculation just like we did with

clustering, but

now we've applied it to how far away are you from our estimate for each sample.

So we kind of average those, and we use M -1 here,

because we're actually trying to get an unbiased estimate of the sample variance,

although that doesn't necessarily matter once you have a large enough sample,

that -1 isn't a very important quantity.

So then we have an estimate of our parameter, and we have an estimate of

the variability, and then we can divine something called a confidence interval.

So, the confidence interval is basically our estimate minus some fraction of

the variability to the lower side, and then the same thing,

our estimate, plus some fraction of the variability to the upper side.

So this tells you a little bit about how much variability we have in the sample.

That's the sx.

That's the square root of this variance estimate up here.

And then the square root of n tells us that as the number of

samples grows bigger and bigger, we have less and

less variability in our estimate of the sample mean.

And then we also have some constant here that says how wide do we

want this confidence interval to be?

How much do we want it to trust?

And so a confidence interval is defined by the probability that

the real parameter is covered by the confidence interval.

We can set that to be some function of this constant here.

So if we set that constant, if the data are normal and we set the constant to

be 1.96, then the probability that the true parameter is covered by

the confidence interval if we recalculated the confidence interval over and

over again for new data sets would be 95%.

So in general you're looking for confidence intervals give you some idea of

the expected range of values that you're somewhat confident will cover

the true range of value if you repeated the study over and over and over again.

So inference, again, is a whole different class.

I here linked to a really good one.

But you could also keep in mind that the basic thing that you need

to do is you need to estimate the quantity that you care about in

this case the regression coefficient, and then do something like

quantify the uncertainty in that regression coefficient.

And then use that uncertainty quantification to say something about how,

what are the likely values for the real parameter in the population.