A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

来自 约翰霍普金斯大学 的课程

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

209 评分

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

从本节课中

Module 2A: Summarization and Measurement

Module 2A consists of two lecture sets that cover measurement and summarization of continuous data outcomes for both single samples, and the comparison of two or more samples. Please see the posted learning objectives for these two lecture sets for more detail.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Actually I have a special lecture section dedicated this topic on what we can expect to

happen to data values in a representative sample with increasing sample size.

And I do this to dispel some common myths that people

have about the role of sample size in statistics.

So if you don't have these myths after you hear them,

forget them because I don't want you to actually start to think they're real.

But there's two common myths that people have about the role

of sample size with regards to a sample distribution of data.

The first one is the larger our sample,

the more symmetric and bell shaped the distribution of

the sample values will be - the more normalesque.

Now, we haven't formally defined a normal distribution

and we will do in the next lecture set,

but this is not the case.

The size of the sample - regardless of the size of the sample,

the values in our sample should mimic the distribution of values in

the population from which the sample was taken if the sample is truly representative.

The second myth that people tend to have is that

increased sample size results in lesser variation in the individual values in our sample.

And there's some reasons for this myth becoming popular;

it's because people are getting this confused with another quantity,

another measure of variation that we'll deal with later in the course.

But suffice to say,

increasing the sample size does not

systematically decrease the sample standard deviation.

Again, you have to think;

if we're taking a representative sample,

regardless of what the size is,

we hope it will reflect

imperfectly the underlying characteristics of the population at large.

And so regardless of the sample size we take,

the distribution of values in our samples should mimic that imperfectly and

the estimated standard deviation should be

an imperfect estimate of one underlying quantity,

the variation of individual values in the population from which the sample was taken.

So in this section, we're going to actually talk about the influence that

sample size has on sample-based estimates.

And this is an important concept to cover and reiterate because this is the big source of

confusion when most people take statistics for the first time.

So upon completion of this lecture section,

I hope you'll be able to understand that a random sample taken from

some larger population will imperfectly mimic

the characteristics of the larger population to some degree,

regardless of the size of the sample.

Understand that the distribution of values in a random sample should reflect

the distribution of values in the population from which it is taken, albeit imperfectly.

Understand and explain that sample size does not systematically decrease or

increase - increase - sample summary statistic estimates.

And begin to understand that while sample size does not systematically

decrease or increase sample summary statistics,

the estimates become less variable from sample to sample with larger samples.

That's an idea of what we'll spend a fair amount of time on

this term and we'll just start to get to right here, right now.

So let's talk about what happens when we add data to a random sample,

take more data from the same population.

And I've relied on some simulation to do

this to get the point across for this first example.

Here is our friend dataset,

the systolic blood pressure of 113 randomly selected men.

And then I what am going to show you here is what I showed you in the last two sections.

Here's the sample mean and standard deviation.

Here's the histogram and I know I had said before that making the bars too

small is maybe too detailed for a given dataset,

but I'm going to do that here so I can show you

what happens to the distribution when we add more information.

So this is the distribution of these 113 sample measurements.

What I'm showing here is the original 113 measurements we

just had plus 100 more measurements taken randomly from the same population.

So now our total sample size is 213.

And look at this, the fundamental shape of

the distribution looks very similar - symmetric, reasonably bell-shaped.

Some of the peaks that we had before in terms of

higher percentages have attenuated a little bit as we've filled this out.

I've kept everything on the same scale,

but the fundamental shape of the distribution looks the same.

The sample mean of this larger sample is smaller than the sample mean we saw before,

but the sample standard deviation is slightly larger.

Now let's start with

those 113 measurements we originally had and by the magic of computer simulation,

I'll add 887 more blood pressure measurements - this - so that

our total sample size is now 1,000 men and here's that distribution.

And you can see it looks more filled in than the previous two sets,

but the fundamental characteristics are the same.

It's symmetric - roughly symmetric and roughly bell-shaped.

So we get a better characterization of it,

but it's similar to what we saw before.

So it's not like the fundamental characteristics of

the distribution appear to be changing the more data we add,

it still takes on the same shape.

And if you look at the sample-based numerical summaries,

the mean here is a 123.1,

which is similar to the previous two sets of sample means,

but are not exactly the same in value and

the sample standard deviation of these 1,000 values is 12.7 millimeters of mercury,

which is slightly smaller than the other two samples,

but on the same order of magnitude.

So let me first

clarify what I was trying to get at in

those previous slides and we'll look at some more examples.

But when we took larger samples from the same population, the histogram,

the picture we had filled out a bit

- got more detailed in terms of the concentration of bars,

but the fundamental characteristics of the shape did not change.

And in terms of the summary statistics - I mean,

I'll show you the results.

What I did here in the table is I randomly took

five samples from this theoretical population of men.

This was through the - again,

computer simulation to illustrate a point - I randomly took five samples of size 113,

randomly took another five samples with

213 observations and randomly took another five samples with 1,000.

And what I did here is I'm just showing

a table with the sample means from each of those sets.

So the first sample from each size,

I'll just call it - so there's no connection or meaning to this number,

this just means this is the first sample of 113 I

took and its mean was 124.4 millimeters of mercury.

For the sample 213,

the mean was slightly lower, 123.5.

And for the mean of 1,000,

it was even lower at 122.6.

So you might look at this and go well, clearly,

the larger your sample becomes,

the systematically smaller the sample mean becomes.

But let's look at the next set of samples.

Here, the sample of 113,

its mean was 121.7 and the mean for the sample of

213 was 123.4 and the mean from the random sample of 1,000 is 123.6,

so we have the opposite phenomena here;

as sample size increases,

the mean estimate seems to increase.

But what I'm hoping to convince you of in this lecture - and if you

look at these other sets here carefully,

you'll see there is no real pattern to that estimate.

They all sort of circulate around similar center,

but there's variation across the samples of

different size and there's variation across samples of the same size.

If you look at these five estimates for samples of 113,

some are larger than others.

The point being here that - is that the sample mean estimate is not

systematically affected - increased or decreased - by the sample size.

Sometimes larger samples have larger means and smaller samples,

sometimes they have smaller means depending on the sample we got.

What I want you to start thinking about as well, though,

is if we look at the variation of the estimates within samples from the same size,

we'll see that the variability in the estimated means tends to get lesser

the more information goes into each mean and we're - this

is just a start on something we'll spend a fair amount of time with in a little bit.

But it's not so much that

the values of the sample means go up or down based on the sample size,

but their precision will get better and there'll be less variation

between them across samples of similar size the larger the sample is.

Similarly, sample standard deviations from these five random samples of

men of the groups of 113 men in each sample, 213 and 1,000.

So let's look at the first.

So this - again, to reiterate,

I just randomly took a sample of 113 by simulation.

The sample standard deviation estimate for that was 12.9.

I then randomly took a sample of 213 men from

this theoretical population and

the sample standard deviation estimate was 13.1

millimeters of mercury on these 213 values.

And then I randomly took a sample of 1,000 men from this theoretical population and

a sample standard deviation of those 1,000 values was 12.4 millimeters of mercury.

So you see from the first sample of 113 to the sample 213,

our sample standard deviation estimate went up a little bit and then it went down.

Well, that's just by chance.

If we look at the next set of values,

we'll see something slightly different.

The next sample of 113 I took,

the standard deviation of the 113 values is 14.4 millimeters of mercury.

But then I took another random sample of

213 values and its estimated sample standard deviation was smaller,

it came in at 11.6.

But then I took another random sample of 1,000 values and

its sample standard deviation went back up to 13.3.

So again, the point here is that we can't predict.

If I said I'm going to give you

the results from a sample of 113 men and a sample of 1,000 men,

which one will have the larger standard deviation estimate, you can't tell me.

There's no connection between the value and systematically with the sample size.

However, and this is what we're getting to later on,

if you look at the variation in the estimates across

the samples - random - different random samples of 113 men,

we get different estimated standard deviations.

They're all estimated the same unknown,

underlying population quantity, but on imperfect subsamples.

We'll see that if you look at the same estimates from samples of size 1,000,

those estimates based on the larger samples are closer to each other.

And we'll formalize that very shortly.

All right. Let's go back to another familiar example,

let's look at the length of stay claims with Heritage Health.

Again, this is the 12,928 claims

per person in the year 2011 who had at least one day of inpatient stay.

And you recall on this sample of almost 13,000,

the sample mean was 4.3 days;

sample standard deviation is 4.9 days and the median was two days.

So what I did here was sort of backwards from what I did before.

Here we had a large sample of real data to start with.

Now I just took a random subset of 200 of

these patients from the larger sample to mimic a sample 200 from the population.

And here's what the histogram of these values looks like.

And if you may recall from back what we'd done before,

the histogram of the full sample of 13,000 patients was what we call

right-skewed and the majority of points were concentrated between one and two days.

And we see the exact same thing here;

it's an imperfect representation of that larger sample histogram.

It's not as detailed,

but it nevertheless captures that same essence.

The sample mean on these 200 patients is

4.1 days and the sample standard deviation is 4.7 days.

Neither is exactly equal to the sample estimates from the larger sample.

And now with 500 patients.

And this is a histogram of those 500 values randomly selected from the larger group.

We could see that if you superimpose this histogram on the previous one,

not everything will line up perfectly by any means,

but the essence of this histogram is the same.

We still get the same concentration of values on the lower numbers with that right tail.

And the sample mean for this particular random sample of

500 patients is 4.2 days and the standard deviation is 4.6.

Here we do - now I took a random sample of 1,000 patients from that larger group.

And again, the essence of the distribution of these points - if you compare this,

if you were to scroll back and look at the previous two slides,

you see this picture was more

detailed and a little more filled out than the previous ones,

but the essence is very similar.

That right-skew continues.

And this is important, right?

Remember, a sample, regardless of the size,

is our best representation of some process we can't directly observe.

So we want it to mimic that process and we don't

want it to change how it represents that process,

given the size of the sample we get.

So what we see here, at least empirically,

is that regardless of the sample size - if we only look at three examples -

the picture remains - or can retain the same characteristics.

So ostensibly, we would conclude off any one of

these samples that the underlying distribution

of the data points that we've sampled is right-skewed.

Now let's go back and look at a histogram of the entire original sample,

which itself is a sample from a larger population of pros.

Now, this was all claims in a one-year period at Heritage Health,

but we could think of it as representing

the process at all such health plans for the year or it's a year,

a sample - one-year sample from multiple years worth of such data as well.

But in any case, it's a large sample.

And what we had been sampling from for those smaller sub samples was this dataset.

And you can look at this original histogram we explored in the last section;

you can see that we continue - and you knew going

into this that the distribution was right-skewed, as we had discussed.

If you compare this histogram to

the last three histograms we looked at done on smaller groups of data,

you'd see that the characteristics across all four are similar.

The right-skew, et cetera.

This one will be more detailed than the previous

because there's more data points to put in it,

but the underlying characteristics are retained,

regardless of the sample size we've looked at.

So just a tabular form is not the best representation,

but I just wanted to look at a few special cases.

So what I did here was similar to what I'd done with the blood pressure data,

where I randomly took five samples from this Heritage Health data,

each with 200 patients and then I randomly took five samples,

each with 500 patients and I randomly took five samples,

each with 1,000 patients.

So this was done off the original sample of 13,000 using the computer.

And what I want to do is just show you what's called the first set of samples,

the first one I took with each sample size.

Here is the mean length of stay for each of these samples.

So for the first sample I took of 200 persons,

the mean length of stay was 4.05 days.

For the sample 500,

it was slightly larger at 4.19 days.

And for the sample of 1,000,

it struck a balance between the two at 4.15 days.

Let's go down to the fourth row, just for variety.

So this is, for each case,

the fourth sample I took randomly.

So the sample of 200,

the length of stay for - the mean length of stay was 4.03 days.

For that random sample 500 observations,

it was 3.99 days.

And for the random sample of 1,000,

it was 4.14 days.

If you look at all these different scenarios - and this is

only five of the infinite possibilities,

you can see sometimes the mean goes down as a function of sample size;

sometimes it goes up and then comes back down;

sometimes it goes up.

There's no consistent pattern here.

So if I said - I've got two samples,

one with 200 patients and one with 1,000 patients,

both randomly selected, which one will have the larger mean? You can't tell me.

Again, though, if you were to focus your energies at

looking at the variation in the estimates across samples of the same size;

so the first sample size 200 was randomly selected.

Then I selected another sample,

which ostensibly had different people on it and the means vary.

If you look at the variation and the mean estimates across these five samples,

where you'd sample that 200 people,

it is actually less than the variation in

the means where each mean was based on 500 people.

And when we look at the variation in the means based on 1,000 persons,

that's the least - or theoretically,

the least of the three.

But we can't predict how any individual value from

one sample of given size will compare to the random sample of another size.

There's no systematic link.

The same thing goes for standard deviations.

If I look at the first set of samples that I took - the standard deviation of

the length of stay measurements among the first 200 randomly selected was 4.68 days.

It increased - well, it didn't really increase,

but in the sample of 500 taken,

it was 4.78 days; slightly larger.

And then the sample of 1,000 - that first sample of 1,000 I

took randomly was the largest at 5.02.

But in the second round of samples,

the sample with 200 observations had the highest standard deviation,

followed by the sample with 500,

followed by the sample with 1,000.

So if you look at these five examples,

and you could look at many more had I done them,

there's no consistent pattern between the value we get and the sample size.

You can't get a smaller standard deviation by increasing your sample size.

And we wouldn't want to, because think about it.

All these numbers here are estimating the same underlying quantity,

just using different imperfect subsets of information.

They're all estimating the population standard deviation Sigma that I can't have.

So we'd hope - you know, we expect them to vary

because the samples contain different elements,

but we're hoping that regardless of the sample size,

they're estimating the same underlying thing.

Again, if you were to look at the variation of values across samples of the same size,

it will tend to be - that will tend to decrease the larger the samples are.

But the values of the sample elements,

the variation of the sample elements,

the standard deviation is not systematically linked to the sample size.

So this is a really important point.

Notice that whether we're dealing with the means or the standard deviation estimates,

they don't systematically increase or decrease in value with increasing sample size.

However, I do want to reiterate - and this is something we're

going towards in the next series of lectures - that

the variation in these estimates does tend to decrease with increasing sample size.

But the individual values of the estimates do not tend to increase or

decrease systematically just because they're based on larger samples.

One last example are Philadelphia temperature datum.

Remember, this was - the data was taken over 15 years and it was a fair amount of

variation because we had all seasons and the mean was 54.3 degrees Fahrenheit;

standard deviation was 17.8 degrees Fahrenheit.

So I took a random sample of 200 days across

this 15-year period and here's the histogram of those data points.

And this is small relative to the uber sample we had before,

which in itself was a sample from some larger population.

But you can see that the histogram sort of retains

those characteristics we saw of the entire sample in

that it tends to have

some lower temperatures that are less frequent than the bulk, which are larger.

The sample mean for these 200 data points

was 54.8 degrees and the sample standard deviation was 17.5 degrees.

So now, let's look at a random sample of

500 days from this Philadelphia temperature dataset,

larger than the 200 we

randomly selected before - or I randomly selected with the computer.

So you can see this histogram is a little more

detailed and filled out than the one of 200 observations,

but it still captures that similar left-skew.

Here are the sample characteristics with these 500.

And finally, this is the histogram that we saw before of

the entire sample of this 15 years worth of data, the 5,471 measurements.

You can see it's certainly more filled out than

those previous examples with 500 and 200 randomly selected values,

respectively; but this is still showing

that same tendency to having a left-skew with the less frequent lower temperatures.

And just to reiterate,

just to drive the point home one more time,

I did the same thing I had done with the previous examples where I

took multiple random samples of the same size and tracked their sample means.

I'll have you go through this,

but you'll see that there's

no particular pattern going from sample size 200 to 500 to 1,000.

But if you look across samples - mean estimates from random samples of the same size,

you will see the variation in those values tends to go down the larger the sample is.

That is, our estimates of the underlying true population mean get better or

more precise the more information the mean is based on.

But they don't systematically increase or decrease with larger samples.

So the same thing goes for

the sample standard deviation and since we've done this a couple of times now,

I'll let you look at this and convince yourself,

but you hopefully will see that if you look at,

for example - and remember,

this is just an index here;

the first sample I took was 200,

another sample was 500,

another sample was random it's like a size of 1,000.

And I did that five times.

And if you look at the values here,

there's no pattern to whether they increase or decrease given the respective sample size.

However, you don't have to look at this

now and we will certainly spend more time on that later,

if you were to look at the estimates across the five samples of

the same size - across the five samples based on

200 random observations or across the five samples based on 500,

et cetera, what you tend to see is that

the variability of the sample standard deviation estimates

across the samples goes down the more information each estimate is based on.

So the variation in the estimates across the five samples

tends to decrease the larger the samples are.

In otherwise - in other words,

the precision of the estimates gets

better and we'll talk about that as we move on in the course.

And just for fun, there's one more table here.

We hadn't done this with the medians before,

but I thought I'd just show you the median estimates from the same sets of

samples and you can see there's no consistent pattern of

increase or decrease there with sample size.

So let's wrap this up.

So a couple of key points here that I'm

going to drill into you over the rest of the term.

The distribution of sample values of continuous data should

imperfectly mimic the distribution of

values in the population from which the sample was taken.

So if I'm sampling from a population where the data values are right-skewed,

I'd expect my sample to have a right-skew distribution,

whether it was based on 40 observations,

100 observations or 10,000.

With regards to the distribution of sample values or summary statistics,

increased sample size will not systematically alter the shape of the sample distribution.

Like we said, it - right-skewed is right-skewed,

regardless of our sample size.

It will only result in a more filled out picture.

So we'll get more information in better detail,

but we're not going to see the characteristics

of the distribution change when we add more data.

With regards to the summary statistics,

increased sample size will not systematically alter the values of the sample statistics.

The sample statistic estimate is like a sample;

mean will vary from random sample to random sample,

but will not systematically get larger or smaller with increasing sample size.

So we can't get our data to be less

variable by taking a larger sample size or individual measurements.

Standard deviation of them will not depend on the size of the sample systematically.

However - and something you may have thought of before - is if I offered you - said hey,

I've got information based on the sample of 100 and

information based on the sample of 1,000 and said which one do you want?

Assuming similar cost, you'd probably take the estimates based on 1,000.

Why? Well, more is better.

Well, what does that mean in terms of statistics?

Well, it means that - remember,

our mean is less influenced by extreme observations;

the more data goes into it,

our mean becomes more stable;

so does our standard deviation the larger the sample size becomes.

So it doesn't necessarily get bigger or smaller as

compared to estimates from smaller samples,

but it becomes more precise and there's less variation in it.

And we'll spend a fair amount of time on this in just a few lecture sets.

All right.

So in the next set of lectures or in the next lecture section, we're going to say well,

now that we've sort of fleshed out some characteristics of samples and

looked at how to quantify them and how to visually display them,

let's talk about how we can use these summary measures and

graphics to compare distributions of data from different subgroups.