A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

180 ratings

Johns Hopkins University

180 ratings

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

From the lesson

Module 3A: Sampling Variability and Confidence Intervals

Understanding sampling variability is the key to defining the uncertainty in any given sample/samples based estimate from a single study. In this module, sampling variability is explicitly defined and explored through simulations. The resulting patterns from these simulations will give rise to a mathematical results that is the underpinning of all statistical interval estimation and inference: the central limit theorem. This result will used to create 95% confidence intervals for population means, proportions and rates from the results of a single random sample.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Greetings and welcome back to Section D of lecture set six.

And in this section, we're going to talk about estimating this theoretical

sampling distribution, who we've shown how to do it with simulations.

But now we're going to talk about doing it based on what we'll get in

reality, the results from a single sample

data from any population that we're studying.

So, hopefully upon completion of this lecture section, you should be able to,

Hopefully, you'll be able to explain what we'll define as the central limit theorem.

And from now on, we'll refer to it as, by its initials, the CLT.

with regards to the properties of theoretical sampling distributions.

Something that unifies.

Some of the specific examples we saw, in lecture Sections B and C.

And then we'll see how to estimate the variability, in the sampling distribution

for sample means and proportions, using the results from a single random sample.

And then we'll hopefully give you some window to begin to appreciate.

How this estimated sampling distribution can ultimately allow for

the incorporation of sampling variability into our estimated sample statistic.

When, going from the information or sample to the larger population that we can't

fully observe.

Some real life research, let's talk talk a little bit about that.

In Sections B and C, we showed the results of

computer simulations to illustrate some

general properties of sampling distributions.

I had the population level information, or at least pretended I did.

Stored on the computer.

And then I took multiple random samples of

the same size from that population and computed multiple

sample statistics.

And then we looked at the observed distribution of those sample statistics.

In real life research, if we had access to the population we'd be done.

But what we're ultimately going to be able to do

is, we won't have all the population level information generally.

And so we can only take one sample from each population under study.

So the question is, with, with the single sample, how can we begin

to understand how the estimators we're

interested in would vary across multiple samples.

To get some sense of how stable or noisy they are.

So the question we want to answer is, how can we use the results of the

single sample to estimate the behind the

scenes theoretical sampling distribution of a sample statistic.

And ultimately, and we'll get into this in lecture seven and beyond,

how can we use this result.

To help us relate the information in our sample to the larger population.

So let's just, just summarize some common themes

that we summarized in Sections B and C.

You know, regardless of the type of data

we were summarizing in the simulations, continuous data.

Binary data or time-to-event data using the appropriate sample

statistics, means for continuous data, proportions for

binary data, and incidence rates for time-to-event data.

The real, there's only empirical sampling distribution of

these statistics that we showed was generally symmetric.

In other words, approximately normal, regardless of the

size of the sample, each statistic was based upon.

There was variability in the

estimated sample statistics, hence that distribution.

But this distribution was centered, in other words, on average.

It took on the true value of the

underlying population level quantity being estimated by the statistic.

And then, the variability in this distribution across the

different summary statistics from multiple random samples, systematically decreased

the larger the sample each estimate was based upon.

So, we saw some specific cases demonstrated

through simulation, but there's a mathematical theorem that

unifies what we saw in those examples

and generalizes it to pretty much every situation.

And it's called the Central Limit Theorem.

And again we will now refer to this by its cooler name, the CLT.

The initials.

And basically this CLT states, that the

theoretical sampling distribution of a sample statistic will

[INAUDIBLE]

to take every.

All we did was take a certain number

of random samples but the theoretical sampling distribution.

Which showed the distribution of a sample statistic

across all random samples from a larger population.

The central limit theorem says, this distribution will be approximately

normal, have on average true population level value being estimated.

And the variability in the estimates.

In this distribution will be a function

of the variability of individual measurements in the

population, the standard deviation, and the size

of the sample the statistic is based upon.

So this variation in the sampling

distribution, it doesn't measure variation individual values.

For any one sample it measures the

variation in the sample statistics across multiple

random samples.

And this variability statistics across multiple random samples is sort of like

a standard deviation of samples statistics

but in order to distinguish this variation.

In sample estimates across multiple random samples

from variation in individual values in any one sample.

We call this variation in sample statistics

across multiple random samples of the same size, the standard error.

So let's just look at this pictorially and then give some specific examples.

The CLT basically says, it backs up the results we saw in our simulations and

takes them to another level saying this will

pretty much happen with any situation you can.

Throw at it.

If we were to talk multiple random samples of the same size from the same population.

And look at the distribution

of the sample estimates across these multiple

samples, it would be approximately normal as you

can see here this was approximately normal if we did a histogram of all these.

Sample statistic estimates and then fit a curve to it.

It would be approximately normal.

[SOUND]

There

would be variation in this which is a function of.

There's to be variation of these statistics which is.

The variability under here what we call the standard error,

is a function of several things including

the size that each statistic is based upon.

And on average,

these estimates would equal

the true value of the thing they were estimating whether it be a mean

for continuous data, a proportion for binary

data, or an incidence rate, true incidence rate.

Per time-to-event data.

So let's, let's put some specifics on this and give an example here.

Sample means based on sample sizes.

So if we are summarizing, sample of continuous data and we compute the sample

mean, we will suppose we did this for many samples.

Well the distribution of our sample mean estimates if we were to actually.

To a histogram and put it, you know,

it would be reasonably described by a normal curve.

We did this over and over again and plotted our sample means.

On average there'd of course be this variation in the values

but on average these would work out to be the true mean.

Some estimates

would come below it.

Some would come above it, but on average, it

would be, these estimates would average out to the truth.

And then this variation, the standard error, is actually

a function of two things. It's equal to the.

Variability of individual values in the population from which we're sampling.

Divided by the square root of the sample size.

So you can see explicitly in this formula, that n is in the denominator.

So the standard error.

Will decrease as a function

of sample size. There's only one problem.

We don't know the true mean of course. That's why we're doing all this.

And.

Hence we don't know the true standard deviation of individual values

in the population of continuous data, that we've taken the samples from.

But what we're going to do in real

life, is we're going to estimate the standard error.

A standard error, by replacing. Sigma in that formula, which we

don't know, with the standard deviation of the values in the sample we've taken.

So you see between the central limit theorem, and

this replacement, we've pretty much

characterized the behind-the-scene sampling distribution.

Of the sample mean from samples of the same size of the sample we've taken.

So let's look at a real example here.

Okay?

So we, suppose we had that sample of

113 adult men taken from a clinical population.

Now there's some

theoretical larger population. We've only got 113 observations from each.

Our observed sample mean is 123.6 millimeters of mercury.

That's our best guess for the true mean amongst all men.

That's all we've got.

And our estimate of the true variation in

blood pressures among all men in this population.

We can estimate

it with our sample standard deviation, which is 12.9 millimeters of mercury.

This is all the information that we have.

So, can we characterize, as best we can, a theoretical sampling distribution.

A means of samples of size 113 from this clinical population.

Well, let's piece this together with what we know.

The central limits set, theorem says, I mean I know you've

only got one mean here but if you were to take multiple

sample means.

Of the same size and do a histogram of

them across all the random samples of the same size.

Distribution would be approximately normal.

Furthermore, it would be centered at the true

mean blood pressure of all men in this population.

And finally, the standard error,

this is somehow times how we write the standard error

of a sample mean SE with X bar in parenthesis.

True standard error is equal to sigma, or the square root of 113.

But of course we don't know sigma.

So the best we can do, is give an

estimated standard error, sometimes written as SE hat X bar.

And here we replace that unknown sigma with the known estimate of sigma

12.9 millimeters of mercury divided by the square root of 113.

And if we do the math for this, it

turns out the estimated standard error of sampling blood

pressure measurements across a multiple random sample, each based

on a 113 men, is 1.2 millimeters of mercury.

So we've pretty much characterized.

The theoretical sampling distribution using the

results from the central limit theorem

and this estimate of the standard error based on our sample results.

You may say, why are we doing this?

Well, we'll come back to address that question in great detail in lecture seven,

and we'll also give some light to this at the end of the lecture.

But let's practice another example.

Of estimating

the sampling distribution based on a result single-sample of data.

So how about if we looked at the length of stay claims at the

Heritage Health plan with an in-patient stay of at least one day in 2011.

So this is a much larger sample than the 113 we had before.

There are 12,298 claims in these data.

Well, our sample mean length of stay is 4.3 days.

And our sample standard deviation in those values was 4.9 days.

There was a lot of variation in individual values.

And it was a heavily skewed distribution in our sample.

So what does this central limit theorem tell

us about the sampling distribution of sample means.

From random samples of 12,928 claims from this Heritage Health population.

Well, it tells us this.

We've already got this, by the way, that if we did a histogram of

the sample means across the multiple random

samples, it would be approximately normally distributed.

It would be centered on average at the true mean length of stay.

And we can't ,specify exactly the standard error cause we don't know the sigma

components from the formula, but we can estimate it based on our sample results.

By taking our observed standard deviation of the values in our

sample, and dividing by the square root of our sample size.

This sample is.

Very large and if you do the math on this,

what you get is a standard error of about 0.043 days.

So on average estimates of samples of 12,928 claims

means, sample means get very close to the underlying truth.

And they vary around that such that on average the distance

and the estimate falls from the truth is about 0.043 days.

So we've described this sampling distribution.

Let's look at another example,

this time dealing with binary data.

The central limit theorem, well let's look at what it tells us.

If we take a sample of binary data, size n,

we did this over and over again from the same population.

And computed multiple p hats, each based on samples of

the same size, and did a histogram of those p hats.

Sorry, having a little trouble drawing the histogram part here.

[INAUDIBLE]

that would be approximately normally distributed.

Be centered, you know, there would be some variation in

these estimates, but on average they equal the unknown truth p.

The true proportion and. The standard error

of these proportions is equal to the square

root of the proportion itself times one minus that proportion

over the size of each sample, the estimated proportions are based upon.

So you notice this we can.

We mentioned briefly the standard deviation of

binary data instead it's not very useful as

the single summary measure on binary outcomes.

However, we can see in this formula that

it plays a role and essentially this formula

is the standard deviation of binary data divided

by the square root of the sample size.

So it's pretty much the same formulas as we have for means.

It's just expressed slightly differently because the standard deviation of binary

data is a function of the proportion of ones in that population.

Now here's the rub. We don't know p, right?

I don't know p, you don't know p. And if we take a single sample, we

can only estimate it. So, based on a single sample, we can still

estimate this standard error. By substituting our best estimate of p.

Into the formula.

So we use our sample proportion in the formula

to estimate the uncertainty in our estimated sample proportion.

So, let's just give an example of this. Remember the maternal infant HIV study?

And remember we had a total of 363 HIV pregnant mothers in the study.

Some were given AZT.

Some are given the placebo and then their children were followed

for 18 months after birth to see if they actually contracted HIV.

And after 18 months, 53 of the 360 births were

HIV positive for a proportion or probability of 15%.

Contracting HIV within 18 months after birth.

So what does the center limit theorem tell us more generally about this, the

distribution of estimated proportions based on samples of size 363?

Well it says.

Again, if you were to take multiple samples from this population

of HIV pregnant women and then look at the birth outcomes.

If you did this

[INAUDIBLE]

363 women at a time and plotted the proportion

of children who contracted HIV for each sample 363 mothers.

And then, did a histogram of these different proportions.

It would be approximately normally shaped, when you solve the values.

Be centered at the true underlying population proportion

of infants who would ultimately contract HIV within 18 months.

And then the standard error, or the variability in these sample estimates.

Would be a function of a true proportion, over the square root of 353.

Now again, we can't observe that standard error directly because we

don't know the true proportion but we can estimate it, and substitute

in our sample proportion to understand. How varaible on average

sample proportions from samples of 363 persons

are. And so in our case, it would be the square

root of 0.15 times 0.85 which is one

minus 0.15 divided by 363. If you

do the math on this, it comes out

to be 0.019 or 1.9%. So

you might be saying, this is great John. Wow, we get a single sample.

We use the results from the central limit theorem.

And our sample data to estimate the characteristics of the behind-the-scenes

sampling distribution for a statistic

whose uncertainty we're interested in understanding.

Well, how is this going to help us?

And it seems like sort of a theoretical exercise.

Well let's, we're going to get into this in detail in lecture seven, but

let's just try and piece together some of the logic we'll be working with.

The TLT for regardless of the statistic we're looking

at, the CLT says, okay were you to take multiple

samples of the same size, and plot the

distribution of your statistics across these multiple samples.

The distribution would be roughly normally distributed an centered at the truth.

You may say, that's great, all right, but I don't know where that center is.

Furthermore,

the center limit theorem gives us a formula for the standard

error of these values, which we could estimate from our sample.

So we could estimate a standard error whether it be for mean.

A proportion. And we haven't shown in this lecture set.

But we'll do so in lecture seven. How to do this for an incidence rate.

So, we can estimate the standard error of our statistic from a single sample.

So how is this going to

help us?

Well, think about the properties of the normal curve for a minute.

What do we know about the.

Normal curve. We do know couple of things.

We know that, if we were standing at the center of a

normal curve, and we went within plus or minus two variability units,

standard units under the curve.

Remember, standard error is just a form of standard deviation.

It just applies to.

Sample statistics instead of individual observations we went plus of minus two

standard errors from that truth, we get 95% of the estimates under the curve.

So in other words, about 95% of the estimates of the truth.

Our sample base estimates

fall within

two standard errors of the true.

And we can estimate the standard error, at least.

Well, you might say, that's great. But if we knew the truth, we'd be done.

Well, what happens in statistics and in

research, we take a single sample and we're

going to get an estimate, we are going to get

one estimate of the statistic, from one sample.

And it's going to fall somewhere under this curve.

Right? Cause this curve describes the

behavior of all such statistics. It may be.

Here, way out here.

Maybe over here, maybe here or it may be right by

the truth, it may be over here but we'll never know.

But think about this, for most, for 95% of the estimates we

get, they'll fall within this two standard error range around the truth.

So how does that potentially help us? Well, if we start with our estimate,

and put two standard error bound

around it, we'll get an interval.

That for 95% of the samples we could take and compute this interval,

this interval will include the unknown truth between its end points.

And we'll go through this in more detail in lecture seven,

but ultimately this interval helps us to quantify the uncertainty

in our estimate and it's what's called the confidence interval.

And it relies on us using these estimated properties

of the sampling distribution to put this all together.

So just to summarize, in real life research, generally only

one sample will be taken from each population being studied.

The sampling

distribution for the sample summary measure of interest.

Like a mean proportion or incidence rate can be estimated, coupling the results

of the central limit theorem with information

from this single sample from a population.

Ultimately, this process will enable the creation of interval that gives a range

of possibilities for the unknown population value,

level value of the quantity being estimated.

So,

we can estimate, not only do we know the shape and where the center, we

can estimate the standard error of our estimate

based on the results from a single sample.

So the estimated standard error, sample mean, we can estimate with the sample

standard deviation of individual values divided by

the square root of the sample size.

The estimated standard error of a sample proportion.

Based on n observations can be estimated by,

taking the square root of the prop, estimated proportion

times one minus itself, divided by the sample size.

And coming forth, coming in lecture seven we'll also

show how to do this with the incidence rate.

So, onward and upward to the next set of lectures where we

actually create these confidence intervals and take this to the next level.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.