An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

来自 Johns Hopkins University 的课程

Statistics for Genomic Data Science

124 个评分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

从本节课中

Module 1

This course is structured to hit the key conceptual ideas of normalization, exploratory analysis, linear modeling, testing, and multiple testing that arise over and over in genomic studies.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

This lecture is about representing data.

And in specific,

we're going to be talking about how do you represent data with mathematical notation

in a compact way that makes it easy to communicate what you've done when

you've applied a statistical method to a set of genomic data.

And so if you remember from a previous lecture that the central dogma of

statistics is basically that you have a large population of samples or

a large population of people that you would like to understand.

And they have some characteristics you'd like to measure,

but maybe it's expensive to measure it on all of them.

Or maybe it's just very hard to do that.

In genomics, that's very common.

It can cost thousands of dollars to collect a single sample and so

you might not want to sample hundreds of thousands of people.

Or if you have a lot of money,

you might be able to, but in general you won't be able to.

And so what you do is you use probability to take a smaller sub-sample, usually

a random sample is best or depending on the situation of the population.

And then once you have that sub-population that you've sampled from the big

population you make measurements on them.

Because now you have a much smaller set of people or samples to measure and so you

can apply even quite expensive techniques or even time consuming techniques to them.

And then you use statistical inference to say something about the population.

So how are we going to represent all these different quantities that

are involved in statistical inference.

The first one that we're going to look at is a parameter.

A parameter is a characteristic of the population.

It's something that traditionally in frequented statistics is treated as

a fixed value.

So in this case, we're going to look at how many of the symbols are pink

versus how many of the symbols are grey.

In the population, you have 10 out of 14 symbols are pink.

And so these pink symbols, this number of pink symbols represents the quantity that

we would like to estimate in this little simple, sort of idealized experiment,

and so we usually represent that with a Greek letter.

In this case I've used theta, and

so theta is the parameter that we're trying to estimate.

And then the next thing that you would want to represent is

a value of a particular data point for a particular sample.

So for example, we might use the letter C to represent the color

of one of the symbols in our data set,

so in this case it's this upper right pink symbol is given with a capital C.

If we have more than one value that we need to denote,

we usually do that with subscripts.

If you want count the values for each of the different sub-samples, in this case,

you would look at the three different symbols that we have in our sample.

Each one gets a subscript, so you get C1, C2 and C3.

That's how you represent the values we've measured for those data.

And then the next thing that you would want to do is measure

something about an estimate that you would like to infer back to the population.

So we don't know the whole population, but we can get an estimate of the parameter

in the population by calculating a similar function on our sample.

So in this case, say we want to estimate the fraction of pink symbols,

what we would do is just count the fraction of pink symbols in our sample.

And so when we do that, we get two thirds of the sample values are pink.

And so our estimate of that population parameter, theta,

remember, represents the fraction in the real population.

Our estimate of that is theta hat.

And we almost always use hats over Greek symbols to represent the estimate

that we have in the sub-population with the sample that we've taken.

And so we use that to infer back to the bigger population.

We'll talk more about that later so just to summarize.

Data points are represented usually by the letters.

When we're talking about hypothetical values of the data,

they're usually capital letters and when were talking about

concrete values of specific data points, they're usually lowercase.

If we have more than one value of a particular variable, we use subscripts.

So there's C1, C2, C3.

And sometimes we write X for more than one variable.

We do this sort of to make the math notation easier,

it can be very confusing and frustrating, but

sometimes X1 X sub 1 with a subscript will represent one variable and

X sub 2 will represent a second variable, so X with a subscript 2.

In that case, what we need to do is add another subscript to be able to indicate

which person we'd measure those different values on, so X11 might be the count for

gene one on person one.

And then you would get X21, which is the count for gene 2 on person 1,

and so forth.

And so sometimes you have multiple subscripts to annotate

the different variables as well as the different samples in your sampled set.

And so, when we want to look at quantities in the global population,

we look at Greek letters, so usually here we use

theta to represent the proportion of pink samples in the population.

It could also be a more concrete example,

suppose you wanted to measure the heights of everybody in the US, and

you wanted to look at the average height of a person in the United States.

You could call that value theta, and then if you took a sub-sample of the people,

obviously it's expensive to measure the heights of everyone in the US.

So suppose you took a random sample of 1,000 people, and

you took their average height.

That would be an estimate of the population parameter, and

you would denote that with a hat, so

it would be theta hat would be the estimate average height in the population.

In regression models, we'll talk about those a lot later, we usually treat

the variable Y as the outcome and X variables are the covariates,

or the variables that you're trying to predict the outcome from.

So the two most common letters are used to represent variables in statistics are Y

for the outcome and X for the covariate variables.

And so that's how you represent data with mathematical notation.