A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

来自 约翰霍普金斯大学 的课程

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

207 评分

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

从本节课中

Module 3A: Sampling Variability and Confidence Intervals

Understanding sampling variability is the key to defining the uncertainty in any given sample/samples based estimate from a single study. In this module, sampling variability is explicitly defined and explored through simulations. The resulting patterns from these simulations will give rise to a mathematical results that is the underpinning of all statistical interval estimation and inference: the central limit theorem. This result will used to create 95% confidence intervals for population means, proportions and rates from the results of a single random sample.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

So in this section, I will show the results for some computer simulations.

And these will help us understand the idea of the sampling distribution.

These demonstrations will show the resulting

distributions of sample means across multiple

random samples of the same size taken from the same theoretical population.

These simulations are a tool to empirically

demonstrate the difficult concept of a theoretical sampling

distribution of a sample statistic. And this will get us started on that idea.

Okay now let's build on some of what we

did before, and we're going to look at some examples

of a sampling distribution of sample mean, using computer

simulations and I'll explain what I mean by that, shortly.

So upon completion of this lecture section, you should be able to

describe the sampling distribution of a sample mean in terms of its composition.

We've already defined a sampling distribution, but hopefully this

will reinforce what it means with regards to sample means.

And then also be able to comment

on some characteristics, or list some characteristics from

the sampling distributions, or sample means that we've

demonstrated empirically by the simulations in this lecture.

Including the general shape of the distribution sample means,

where these things are centered, the average of the

sample means in a sampling distribution.

And then the variability of the distributions and the relationship to

the size of the samples each mean the distribution is based upon.

So lets look in the example here.

We have a theoretical population we want to sample from.

I created this with the computer.

It's height measurements for adults greater than or

equal to 18 years and pretend, you know

we're doing research and we can only take

the sample to try and understand what's going on.

Well I know the truth here so for simulation

purposes, I took two samples, one of size 50 and

one of size a 100.

So let's look at the observations in these.

So in the sample of size 50 here this

is the distribution individual heights amongst these 50 people.

I mean, you know its only 50 points

but we get some evidence that the population values

that we're sampling from our own, perhaps somewhat symmetric

and bell, may be a little bit bell shaped.

That might be a stretch with this.

And we have a sample mean of 166.9 centimeters.

So that, that would be our best guess for the true mean height of all

adults greater than or equal to 18 years based simply on this sample of 50.

But since I have the population behind the scenes we can take another random sample.

This time of a 100 people and here are the distribution of

100 heights, and it's a little more fleshed out than those with 50.

We get a little more empirical evidence of maybe

a symmetric roughly symmetric perhaps somewhat bell shaped distribution heights.

The mean of this sample is 161.1 centimeters, and it

differs slightly from the estimate we had from the other sample.

So with these two samples, which we never have the luxury of having in

real life research, we get some sense that the population distribution of heights is

perhaps somewhat symmetric, and centered around the mean

of somewhere on the order of 160 something.

That's all we've got.

So now, I use the computer to repeatedly

draw samples from this population of adults, compute

the mean for each sample and then plot

them in a histogram to estimate the sampling distribution.

And, I did this for samples of different sizes.

So, let's look.

With this first simulation here what I did was I drew a 1000 samples.

A 1000 samples, each with 20 observations, from this behind-the scenes

population, and what I did is I computed the mean for each of these 1000 samples.

And in this histogram here, these are

not individual people measurements, each point in this

histogram is a sample mean from a sample of size 20.

So this histogram here has a 1000 sample means

estimated from a 1000 random samples of size 20.

This is an estimate of the theoretical sampling distribution for sampling

means from samples of size 20 from this population of adults.

On this next slide, I've done the same thing, but I've increased

the size, the number of people in each sample that I've taken.

So I've taken 1,000 samples.

Each sample contains 50 persons.

And for each sample, I computed the sample mean.

So, for sample one, I computed a mean of the, sample one had 50 people.

And then I plotted this mean in the histogram.

For sample two, I had 50 people.

But I didn't plot anything to do with the 50 measurements in there.

I just planned, plotted the sample mean, for

that sample and put that in this histogram.

And, so in this histogram, this histogram has 1000 x bars.

Essentially x bars of height each based on 50 observations.

Finally, if I do this one more time, but

each sample I took now had 150 people in it.

So I took sample one,

had 150 people in it and the only information I'm presenting

about it in this graphic is, is in fact the mean.

So, sorry I can't seem to write that well.

There we go, n equals 150. sample one.

But I just, I, I'm not showing you

the individual heights of the people in the sample.

I summarized it with the mean.

And the only information about this single sample that appears in the

histogram here, is, in fact, its mean. And I did that a 1000 times.

So we've got a 1000 sample means in this histogram, 1000 sample means,

each based on 50 people 150 people.

So we have a 1000 sample means, each based on 150 people in this histogram.

So now, let's look at, you probably noticed something

going on and now I want to put these distributions

of sample means side by side in box plots

to sort of look at what the patterns here are.

So, so what do you notice in this picture?

Well, you probably get a sense before, by looking at those histograms go by.

But here's the 1000 sample means, where each mean is based on 20 observations.

Here's the box plot of the distribution, 1000

sample means when each is based on 50.

And here's a box plot when each is based on 150.

So what do we notice here?

Well the first thing you may notice is what?

That the variation

in the sample mean estimates, the variation decreases

the more information each mean is based upon.

And that probably makes some sense to you, common sense wise.

If I ask you, would you prefer a mean estimated from a random

sample of 20 people or from 150 people, your intuition would probably say 150.

And why do you think that is?

Well think about it carefully but we've talked about the influence of individual

points on a sample mean value and what happens when the sample size increases.

Each individual point has less an influence and that

tends to make the mean more stable across different samples.

What else do you notice here?

Well, look at where the center of these distributions are, at least the median.

So the medians seem to be pretty much lined up, so I can't draw

a straight line here, but the medians, in fact, of these distributions are lined up.

So these distributions have the same or very similar centers as measured by the

median and the distributions look roughly symmetric

so the median is close to the mean.

They have similar centers, but the variation in

the estimated means is decreasing the larger the

sample size.

But the average value, meaning value,

is the same across the different distributions.

So, let me tell you now the punch line. I actually simulated these data.

These samples of data were taken from a distribution of

population mean heights where the true mean was 167 centimeters.

And the standard deviation of the individual

height measurements was two point five centimeters.

So let's look at some numerical summary of those pictures we just saw.

If you took the mean of the 1000 sample means

based on samples of size 20 at one, at each time.

The mean of those 1000 sample means is 167 which is actually equal to the true mean.

The mean of those

samples means based on 50, we took a 1000 samples each

based on 50 people the mean of those estimates is 167.

And the mean of the sample means each based on 150 people is 167.

So what do I mean by mean of means?

Right, well we saw there was a

distribution, there was variability in those sample

mean estimates, but on average, those sample

means came in at 167 which happens to

be the true mean.

If we look at the variation of the sample means, we can see in all three

scenarios it's less than the variability in the

individual measurements, individual height values from our population.

And as we saw visually, it decreases the more information is in each sample.

So what are we sort of tying this up, showing empirically?

We're showing that the sample means, on average,

turn, equal the true mean from the population in which the sample's taken.

But there's some variation in the estimates around that truth.

And that variation decreases, the larger the sample each mean is based upon.

Just FYI, this simulation is a great way to illustrate a

principle, and help us understand this definition of a sampling distribution.

But it's not something that we can do in real life.

In real life

we're only going to be able to take one sample

from each of the populations we're interested in studying.

That's generally the case.

The variation in the sample means that I've showed you depends on the size

of each sample and not the number of samples that I've done in the simulation.

So just to illustrate this I could have done the

same thing and take in 5000 samples, each of size 20.

And 5000 samples, each of size 50. And 5000 samples of 150.

Instead of doing 1000 each time.

And if you look at the distribution of the sample means across these 5000 samples

with each sample size near, the distributions

are very similar to those that we saw,

those that we saw with a 1000 mean.

So the size of simulation, the number of times I actually

sample, does not systematically effect

these distributions, which fueling the differences

in variability that we're seeing is the size of each sample

that each mean is based upon in the graphics we see.

So this is important to note.

In real life research, researchers will only be taking one

sample from each population under study.

As such if it was the deterrent or if it was the number

of samples that determined the variability in

sample means this would make research impossible.

So let's look at another example just to try, try and flesh this idea out more.

Here's another population, hospitals in the US in 2011.

the discharges for kidney and urinary infections.

So this is actually based on a database.

And I'm actually using it behind the scenes, a large database to be

my population, and I'm taking some samples from it, to illustrate this principle.

So lets just say I was a researcher and I could only afford to study 50 hospitals.

And so I took a I got a random sample from CMS, or Medicare and Medicaid

services in the US, and this is what I got.

And here's what shows my sample discharge

counts for the 50 hospitals randomly sampled.

You can see this distribution seems to be somewhat right skewed and the mean in this

sample is on average, the average hospital, at least 50, discharged

69.1 persons for kidney and urinary infections in 2011.

If you look at sample B, which is based on

250 hospitals, I suppose another researcher could do a bigger study.

It, it, it, has the same characteristics but more

fleshed out than the distribution sample A, and again,

in these graphics here, this is the distribution of

the counts for the 250 individual hospitals in my sample.

Each point in here represents the number of patients

discharged from one hospital for urinary and kidney infections.

And the mean amongst these 250 that I've sampled is 71.7 discharges, so

now we have some sense from looking at these two samples, again, a luxury

we wouldn't normally have, that the true distribution that we're sampling from is

right-skewed and has an average somewhere on the order of high 60s low 70s.

That's all

we can ascertain right now.

Remember, the distribution of individual values in

any single sample from a population should

imperfectly mimic the distribution of individual values

in the population regardless of the sample size.

So now I'm going to repeat the exercise of sampling repeatedly for samples

of different sizes and looking at

that distribution of the resulting sample means.

So this graphic

here shows the estimated sampling distribution for sample means or

random samples of size 50 from this hospital discharge population.

So, again, now, this histogram no longer contains

individual hospital measurements, but it contains the mean.

Each point in here is a mean, is a mean from a sample of 50 hospitals.

So, we have this case.

I got a little more adventurous and

decided to repeat the simulation 2000 times.

So,we have 2000 X bars, each from a sample of size 50, 50 hospitals.

And this shows the variation in those

sample mean estimates across the 2000 samples.

Here we're going to do this again, but we're going to

now do this where our samples contain 250 hospitals each.

And so

again, we've got 1000 x bars, excuse me 2000 x bars

in this histogram, and each x bar is based on 250 hospitals.

So we have 2000 summary measures each

summarizing the distribution of 250 hospital

discharge accounts. Finally, we do this one more time.

And here what we have is the estimated sampling distribution of

sample means where this random samples are now, each contained 400 hospitals.

So, again, in this distribution, we have 400 x bars.

Each one is a mean number of discharges for a sample of 400 hospitals.

And we have 2000 means in this picture.

So now let's, let's put these all on one graphic and

summarize the results. So what do we see here?

Now this looks very similar to what we saw before.

If you look carefully at this picture you can see what we saw before.

What do you see here?

So, we've got the these are the means

based on samples of size 50, there's a 2000.

These are the means based on samples of size 250, and these are the means based on

samples of 400 possibles in the niche. So, what do we see here?

Well, we again see that the

variation in our sample mean estimates decreases, the

more information your sample mean is based upon.

We also see that the center that these distributions of sample mean

estimates was, we have some outliers, but on the whole looked pretty symmetric.

They seemed that way in the histogram presentation as well.

And finally, we see that the, the centers

of these distributions,

and I'm not, again, I'm not doing a good job of straight lining here, but they tend

to line up. So, the results show us what?

That the distribution of the sample means,

regardless of the sample size, looked somewhat,

I'll say somewhat normal.

Even though the individual values in any one sample

the distribution of the indivudal values was right skewed.

What else did we see? That the average, roughly the average

and the median because these are roughly

symmetric distributions of that 2000

sample mean values, regardless of sample size was consistent,

[BLANK_AUDIO]

across the three sample size scenarios,

[BLANK_AUDIO]

50, 250, and 400. And then finally, we saw what we

saw before with the, height data, is the variability.

In the 2,000 sample means,

decreased, went down,

when the size that each sample

was based on, each sample mean increased.

So now, we'll come clean about what the

data looked like, the population that this came from.

The true mean, number of discharges in

this population, the true mean was 69.2 discharges.

And the standard deviation of these discharge counts was 58.4.

So there was a lot of variation and the population distribution was right skewed.

But let's look at the results.

Some numerical summaries of the pictures we just looked at.

Regardless of the sampling distribution estimate we were looking at whether

it's based on 2,000 means, based on 50 hospitals of time,

250 hospitals or 400, notice that the mean of our samples

means was consistently very close or equal to that underlying population truth.

Further notice

that the variation in these sample means, the

2000 means we had in each estimated distribution, in

all three cases, was substantially smaller than the

variation in, in the individual hospital to hospital counts.

Variation in the means was lesser than the variation in the individual values.

And it decreases with increase in sample size, which we already noted.

So let's summarize what we've done.

Theoretical sampling distributions for sample means, across random samples

of the same size, from the same population, can be

estimated by a computer simulation, and that's what we've done

here and we'll do it in the next lecture set.

Simulation's a very useful tool for helping explore the

properties in the sampling distribution, and drawing them, basically.

If I tried to do this by hand, it would take forever.

Some properties observed with the two examples in this lecture, which will

be generalized to all such cases, shortly include the things we just noted.

Yeah.

The variation in sample means decreases from

sample to sample, and sample means across samples

decreases, as the amount of information each

mean is based on the sample size increases.

On average regardless of sample size, the

means, so it's kind of a weird thing but the overall mean of these

sample means, these are just numbers so we can average them even though each one

represents an average of the sample is close to

oh, very close to really, the true

underlying population mean.

The thing that our sample means are estimating.

Whether it be the mean heights for everyone in the population.

For the mean discharge for all hospitals in the population.

And finally we did see in both cases with very

different shapes for the individual data in any one sample.

In the first case it was roughly symmetrical for the heights.

For the second case it was

skewed for the individual hospital discharge counts.

That the distribution of the averages from samples across

multiple samples, shape of the distribution of the x bars

across multiple samples

is and I'll put in quotes normal,

to mean approximately normal, roughly symmetric and bell-shaped.

So what we're going to see is ultimately, we can't do these simulations.

We can only take one sample in real life.

So ultimately, estimating the characteristics

of a sampling distribution will be

done using the results from a single random sample from a population.

In lecture section

D, these properties that we've been demonstrating empirically via

the simulations in this lecture set, will be generalized.

We'll see we don't have to take multiple samples either with the computer or, or by

hand to understand how our statistic would behave

across multiple random samples of the same size.

There's some machinery that will just

formalize the patterns we've seen thus far.

In the next section,

section C, we'll show this, these,

these results are generalizable to proportion summaries

on binary data and the incidence rate summaries on timed event data as well.