0:01
So in this section, I will show the results for some computer simulations.
And these will help us understand the idea of the sampling distribution.
These demonstrations will show the resulting
distributions of sample means across multiple
random samples of the same size taken from the same theoretical population.
These simulations are a tool to empirically
demonstrate the difficult concept of a theoretical sampling
distribution of a sample statistic. And this will get us started on that idea.
0:44
So upon completion of this lecture section, you should be able to
describe the sampling distribution of a sample mean in terms of its composition.
We've already defined a sampling distribution, but hopefully this
will reinforce what it means with regards to sample means.
And then also be able to comment
on some characteristics, or list some characteristics from
the sampling distributions, or sample means that we've
demonstrated empirically by the simulations in this lecture.
Including the general shape of the distribution sample means,
where these things are centered, the average of the
sample means in a sampling distribution.
And then the variability of the distributions and the relationship to
the size of the samples each mean the distribution is based upon.
1:31
So lets look in the example here.
We have a theoretical population we want to sample from.
I created this with the computer.
It's height measurements for adults greater than or
equal to 18 years and pretend, you know
we're doing research and we can only take
the sample to try and understand what's going on.
Well I know the truth here so for simulation
purposes, I took two samples, one of size 50 and
one of size a 100.
So let's look at the observations in these.
So in the sample of size 50 here this
is the distribution individual heights amongst these 50 people.
I mean, you know its only 50 points
but we get some evidence that the population values
that we're sampling from our own, perhaps somewhat symmetric
and bell, may be a little bit bell shaped.
That might be a stretch with this.
And we have a sample mean of 166.9 centimeters.
So that, that would be our best guess for the true mean height of all
adults greater than or equal to 18 years based simply on this sample of 50.
But since I have the population behind the scenes we can take another random sample.
This time of a 100 people and here are the distribution of
100 heights, and it's a little more fleshed out than those with 50.
We get a little more empirical evidence of maybe
a symmetric roughly symmetric perhaps somewhat bell shaped distribution heights.
The mean of this sample is 161.1 centimeters, and it
differs slightly from the estimate we had from the other sample.
So with these two samples, which we never have the luxury of having in
real life research, we get some sense that the population distribution of heights is
perhaps somewhat symmetric, and centered around the mean
of somewhere on the order of 160 something.
That's all we've got.
So now, I use the computer to repeatedly
draw samples from this population of adults, compute
the mean for each sample and then plot
them in a histogram to estimate the sampling distribution.
And, I did this for samples of different sizes.
3:39
So, let's look.
With this first simulation here what I did was I drew a 1000 samples.
A 1000 samples, each with 20 observations, from this behind-the scenes
population, and what I did is I computed the mean for each of these 1000 samples.
And in this histogram here, these are
not individual people measurements, each point in this
histogram is a sample mean from a sample of size 20.
So this histogram here has a 1000 sample means
estimated from a 1000 random samples of size 20.
This is an estimate of the theoretical sampling distribution for sampling
means from samples of size 20 from this population of adults.
4:28
On this next slide, I've done the same thing, but I've increased
the size, the number of people in each sample that I've taken.
So I've taken 1,000 samples.
Each sample contains 50 persons.
And for each sample, I computed the sample mean.
So, for sample one, I computed a mean of the, sample one had 50 people.
And then I plotted this mean in the histogram.
For sample two, I had 50 people.
4:58
But I didn't plot anything to do with the 50 measurements in there.
I just planned, plotted the sample mean, for
that sample and put that in this histogram.
And, so in this histogram, this histogram has 1000 x bars.
Essentially x bars of height each based on 50 observations.
5:43
So, sorry I can't seem to write that well.
There we go, n equals 150. sample one.
But I just, I, I'm not showing you
the individual heights of the people in the sample.
I summarized it with the mean.
And the only information about this single sample that appears in the
histogram here, is, in fact, its mean. And I did that a 1000 times.
So we've got a 1000 sample means in this histogram, 1000 sample means,
each based on 50 people 150 people.
6:13
So we have a 1000 sample means, each based on 150 people in this histogram.
So now, let's look at, you probably noticed something
going on and now I want to put these distributions
of sample means side by side in box plots
to sort of look at what the patterns here are.
So, so what do you notice in this picture?
Well, you probably get a sense before, by looking at those histograms go by.
But here's the 1000 sample means, where each mean is based on 20 observations.
7:54
So the medians seem to be pretty much lined up, so I can't draw
a straight line here, but the medians, in fact, of these distributions are lined up.
So these distributions have the same or very similar centers as measured by the
median and the distributions look roughly symmetric
so the median is close to the mean.
They have similar centers, but the variation in
the estimated means is decreasing the larger the
sample size.
But the average value, meaning value,
is the same across the different distributions.
8:28
So, let me tell you now the punch line. I actually simulated these data.
These samples of data were taken from a distribution of
population mean heights where the true mean was 167 centimeters.
And the standard deviation of the individual
height measurements was two point five centimeters.
So let's look at some numerical summary of those pictures we just saw.
If you took the mean of the 1000 sample means
based on samples of size 20 at one, at each time.
The mean of those 1000 sample means is 167 which is actually equal to the true mean.
The mean of those
9:12
samples means based on 50, we took a 1000 samples each
based on 50 people the mean of those estimates is 167.
And the mean of the sample means each based on 150 people is 167.
So what do I mean by mean of means?
Right, well we saw there was a
distribution, there was variability in those sample
mean estimates, but on average, those sample
means came in at 167 which happens to
be the true mean.
9:40
If we look at the variation of the sample means, we can see in all three
scenarios it's less than the variability in the
individual measurements, individual height values from our population.
And as we saw visually, it decreases the more information is in each sample.
So what are we sort of tying this up, showing empirically?
We're showing that the sample means, on average,
turn, equal the true mean from the population in which the sample's taken.
But there's some variation in the estimates around that truth.
And that variation decreases, the larger the sample each mean is based upon.
10:18
Just FYI, this simulation is a great way to illustrate a
principle, and help us understand this definition of a sampling distribution.
But it's not something that we can do in real life.
In real life
we're only going to be able to take one sample
from each of the populations we're interested in studying.
That's generally the case.
The variation in the sample means that I've showed you depends on the size
of each sample and not the number of samples that I've done in the simulation.
So just to illustrate this I could have done the
same thing and take in 5000 samples, each of size 20.
And 5000 samples, each of size 50. And 5000 samples of 150.
Instead of doing 1000 each time.
And if you look at the distribution of the sample means across these 5000 samples
with each sample size near, the distributions
are very similar to those that we saw,
11:23
those that we saw with a 1000 mean.
So the size of simulation, the number of times I actually
sample, does not systematically effect
these distributions, which fueling the differences
in variability that we're seeing is the size of each sample
that each mean is based upon in the graphics we see.
11:43
So this is important to note.
In real life research, researchers will only be taking one
sample from each population under study.
As such if it was the deterrent or if it was the number
of samples that determined the variability in
sample means this would make research impossible.
So let's look at another example just to try, try and flesh this idea out more.
Here's another population, hospitals in the US in 2011.
the discharges for kidney and urinary infections.
So this is actually based on a database.
And I'm actually using it behind the scenes, a large database to be
my population, and I'm taking some samples from it, to illustrate this principle.
So lets just say I was a researcher and I could only afford to study 50 hospitals.
And so I took a I got a random sample from CMS, or Medicare and Medicaid
services in the US, and this is what I got.
And here's what shows my sample discharge
counts for the 50 hospitals randomly sampled.
You can see this distribution seems to be somewhat right skewed and the mean in this
sample is on average, the average hospital, at least 50, discharged
69.1 persons for kidney and urinary infections in 2011.
If you look at sample B, which is based on
250 hospitals, I suppose another researcher could do a bigger study.
It, it, it, has the same characteristics but more
fleshed out than the distribution sample A, and again,
in these graphics here, this is the distribution of
the counts for the 250 individual hospitals in my sample.
Each point in here represents the number of patients
discharged from one hospital for urinary and kidney infections.
And the mean amongst these 250 that I've sampled is 71.7 discharges, so
now we have some sense from looking at these two samples, again, a luxury
we wouldn't normally have, that the true distribution that we're sampling from is
right-skewed and has an average somewhere on the order of high 60s low 70s.
That's all
we can ascertain right now.
14:11
So now I'm going to repeat the exercise of sampling repeatedly for samples
of different sizes and looking at
that distribution of the resulting sample means.
So this graphic
here shows the estimated sampling distribution for sample means or
random samples of size 50 from this hospital discharge population.
So, again, now, this histogram no longer contains
individual hospital measurements, but it contains the mean.
Each point in here is a mean, is a mean from a sample of 50 hospitals.
So, we have this case.
I got a little more adventurous and
decided to repeat the simulation 2000 times.
So,we have 2000 X bars, each from a sample of size 50, 50 hospitals.
15:11
And so
again, we've got 1000 x bars, excuse me 2000 x bars
in this histogram, and each x bar is based on 250 hospitals.
So we have 2000 summary measures each
summarizing the distribution of 250 hospital
discharge accounts. Finally, we do this one more time.
And here what we have is the estimated sampling distribution of
sample means where this random samples are now, each contained 400 hospitals.
16:09
If you look carefully at this picture you can see what we saw before.
What do you see here?
So, we've got the these are the means
based on samples of size 50, there's a 2000.
These are the means based on samples of size 250, and these are the means based on
samples of 400 possibles in the niche. So, what do we see here?
Well, we again see that the
variation in our sample mean estimates decreases, the
more information your sample mean is based upon.
We also see that the center that these distributions of sample mean
estimates was, we have some outliers, but on the whole looked pretty symmetric.
They seemed that way in the histogram presentation as well.
And finally, we see that the, the centers
of these distributions,
17:27
I'll say somewhat normal.
Even though the individual values in any one sample
the distribution of the indivudal values was right skewed.
What else did we see? That the average, roughly the average
and the median because these are roughly
symmetric distributions of that 2000
sample mean values, regardless of sample size was consistent,
[BLANK_AUDIO]
across the three sample size scenarios,
[BLANK_AUDIO]
50, 250, and 400. And then finally, we saw what we
saw before with the, height data, is the variability.
In the 2,000 sample means,
decreased, went down,
when the size that each sample
was based on, each sample mean increased.
18:57
So now, we'll come clean about what the
data looked like, the population that this came from.
The true mean, number of discharges in
this population, the true mean was 69.2 discharges.
And the standard deviation of these discharge counts was 58.4.
So there was a lot of variation and the population distribution was right skewed.
But let's look at the results.
Some numerical summaries of the pictures we just looked at.
Regardless of the sampling distribution estimate we were looking at whether
it's based on 2,000 means, based on 50 hospitals of time,
250 hospitals or 400, notice that the mean of our samples
means was consistently very close or equal to that underlying population truth.
19:47
Further notice
that the variation in these sample means, the
2000 means we had in each estimated distribution, in
all three cases, was substantially smaller than the
variation in, in the individual hospital to hospital counts.
Variation in the means was lesser than the variation in the individual values.
And it decreases with increase in sample size, which we already noted.
20:18
Theoretical sampling distributions for sample means, across random samples
of the same size, from the same population, can be
estimated by a computer simulation, and that's what we've done
here and we'll do it in the next lecture set.
Simulation's a very useful tool for helping explore the
properties in the sampling distribution, and drawing them, basically.
If I tried to do this by hand, it would take forever.
21:19
On average regardless of sample size, the
means, so it's kind of a weird thing but the overall mean of these
sample means, these are just numbers so we can average them even though each one
represents an average of the sample is close to
oh, very close to really, the true
21:50
The thing that our sample means are estimating.
Whether it be the mean heights for everyone in the population.
For the mean discharge for all hospitals in the population.
And finally we did see in both cases with very
different shapes for the individual data in any one sample.
In the first case it was roughly symmetrical for the heights.
For the second case it was
skewed for the individual hospital discharge counts.
That the distribution of the averages from samples across
22:39
is and I'll put in quotes normal,
to mean approximately normal, roughly symmetric and bell-shaped.
So what we're going to see is ultimately, we can't do these simulations.
We can only take one sample in real life.
So ultimately, estimating the characteristics
of a sampling distribution will be
done using the results from a single random sample from a population.
In lecture section
D, these properties that we've been demonstrating empirically via
the simulations in this lecture set, will be generalized.
We'll see we don't have to take multiple samples either with the computer or, or by
hand to understand how our statistic would behave
across multiple random samples of the same size.
There's some machinery that will just
formalize the patterns we've seen thus far.