A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

来自 约翰霍普金斯大学 的课程

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

207 评分

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

从本节课中

Module 3A: Sampling Variability and Confidence Intervals

Understanding sampling variability is the key to defining the uncertainty in any given sample/samples based estimate from a single study. In this module, sampling variability is explicitly defined and explored through simulations. The resulting patterns from these simulations will give rise to a mathematical results that is the underpinning of all statistical interval estimation and inference: the central limit theorem. This result will used to create 95% confidence intervals for population means, proportions and rates from the results of a single random sample.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

So let's ponder a little bit further.

What is a Confidence Interval, and how should it be interpreted?

We've gone over the interpretations for confidence intervals

for means, proportions, with incident rates, with the daily

examples we've looked at, in sections A and

B, but let's take this a little bit further.

So in this lecture section, hopefully you'll gain a

conceptual and practical understanding, or not really gain, but build

on what you've got, of how to interpret a confidence interval for a single

population parameter, like a population mean, proportion, or incidence rate.

Think critically about when a confidence interval is necessary versus not.

And gain some insight as to why the 95%

confidence interval became the standard, if you will, for research.

There are two important interpretations to consider; the scientific or

substantive interpretation that confidence interval gives us a range of plausible or

possible values for some unknown truth, like a unknown true mean or proportion.

And then the technical interpretation which is that

the method by which we create 95% confidence in

intervals from single samples works 95% of the time,

in terms of the interval containing the unknown truth.

In other words, most of the samples

we can take from a population will yield the interval that includes

the true value being estimated by the sample if we employ this approach.

So let's just talk about confidence in, intervals in general once again.

The confidence interval for a population level parameter,

a mean, a proportion of the incidence rate,

is an interval that factors in the uncertainty

in the estimate we have for the parameter because

the estimate is based on data from an imperfect sample.

A confidence interval can be interpreted as a

range of possible values for the unknown truth.

The sample based estimate is our best estimate,

the single number estimate, but the interval recognizes

the potential uncertainty in that estimate given that

it's based on imperfect data from the larger population.

And as we've said before,

confidence intervals can allow for different

levels of uncertainty, 90%, 95%, 99% etc.

However the standard is 95%, and for most cases, unless otherwise specified,

that is what we will use as the norm in this class.

So, when a confidence interval is created from a

sample of data the resulting interval either includes the

value of the unknown, true parameter at the population

level or the interval does not include the truth.

And unfortunately, we will never know if our confidence interval contains

the unknown truth we want to learn about or if it doesn't.

So where does the 95% part come in? The 95% or

some other percent if another level of interval is used, refers to

how often the approach to creating a confidence interval works in general.

In other words, for 95% of the

samples randomly selected from a population, the 95%

confidence interval created from the sample will contain

the true value of the parameter of interest.

For 5% of such

samples, the 95% confidence interval will not contain the true value.

So let's just look at an empirical example.

Using the Medicare and Medicaid data, we looked at the distribution of discharge

counts from hospitals in the US in 2011 for urinary and kidney infection DRGs.

And the true, average discharge count in the US hospital population

in 2011 was 69.2 discharges.

So what I did here to simulate the

sampling process is took 100 random samples from

this population of all US hospitals in the

year 2011, and each sample had 250 hospitals.

So for each of these hundred samples which

had 250 hospitals, I constructed a 95% confidence

interval for the true average discharge count based on the resulting

discharges in the sample. Now what this graphic shows here, is for

each of the samples, get to my samples, which shows the estimate

mean number of discharges of the sample and then gives the confidence interval.

So if you look across this there's 100 lines here,100

estimates with 100 intervals placed around them.

If you look carefully, of these 100 intervals, 96 of them actually cover

the truth, include the truth, which is

represented by this horizontal blue line here.

And four of them, which are highlighted in red if

you're looking at this in color, do not include the truth.

In real-life, were I actually only privy to one sample of 250 hospitals from this

larger population, I would likely get one of

the samples here, or something close to it.

I don't know which one I would have, ergo, I don't know if I would be one

of the lucky 95 that, whose resulting interval

included the truth or one of the four whose

interval did not.

And that's where the chance comes in, once

I've gotten my sample and created the interval, I

don't know if it actually includes the truth

I'm trying to make a statement about or doesn't.

But I know the method used yields an

interval that includes the truth most of the time.

95% of the time. Just

a reminder, in this example we know the population level data.

It was from the CMS data of all hospitals in the US in 2011.

And simulation was used to repeatedly sample.

In real life, again, we only get one sample to create one interval.

And we don't know whether the interval actually includes

the truth or not but the method we use

is such that for most of the samples we

could get, 95%, the resulting interval will include the truth.

So again, in research, however, real-life

research, the truth is not directly observed.

Researchers can only estimate the truth about

a population from an imperfect data sample.

And again, many times a single-number summary for a population will help

us understand something about the distribution

of individual values in that population.

Whether it be a population mean or population proportion, or

a population incidence rate.

So the confidence interval provides a method for combining

our best sample based estimate of this population level summary

measure with an estimate of the uncertainty in this

quantity with regards to estimating that true population level measure.

Let's go back and look at some examples for creating

confidence interval and debrief on them a little bit more.

So remember the length of stay claims

at Heritage Health with an inpatient stay of at least one day in 2011.

So the 95% confidence interval for the true mean length of

stay based on these data was 4.2 to 4.4 days.

So what's the bad news about this interval?

Well, we or the researcher or the reader of the

publication will never know, if the true mean length of stay

describes the length of stay experience for all such patients.

Well, it's included in this 95% confidence interval.

The good news is, however, the method used

to created the interval works most of the time.

So even though we'll never know whether the interval includes

the truth we're getting at, we can pretty much take it

and assume it includes the interval.

Or that's how we interpret it because the chances of it, us getting

a sample that yields an interval that includes the truth is so high.

Here's another example that we've looked at, the response to therapy in

the random sample of 1,000 HIV positive patients from a citywide population.

Recall, the proportion, 206 of the 1,000 persons for a proportion

of 0.206 or 20.6%, responded to the anti retro viral medication.

And a 95% confidence interval for the response proportion, and

we created this in lecture set B, was 18% to 23%, or 0.18 to 0.23.

So again, the bad news is, the researchers will never know if the

true proportion who would respond to the therapy is in the given integral.

But again the good news that outshines that is that the

method used to create this interval works most of the time.

And we can work with this and treat it as our best statement about

the unknown truth given the data we have on hand in our single sample.

So you might be asking are confidence intervals always necessary?

And in many cases, the answer is yes, but not always.

So you might say, well, John, in that

Heritage Health data example, to take an instance,

you essentially had data on all patients, on all patients who had a length of stay

of one day or more

in the Heritage Health Plan for 2011. So why do you need to create a confidence

interval if you essentially have the population of such patients in 2011?

And that's a great question.

So, it depends on what we're using these data for.

If we only wanted to describe the experience

of all patients in that one year then maybe

we don't need to deal with the uncertainty

factor because essentially we have the population level data.

But if we want to actually think of this one year of data as the manifestation

of some larger random process by which patients

in the health plan have length of stays.

And we want to compare characteristics.

For example, this year to previous

years characteristics of the process which governs

the distribution of length of stays, then

we would need to incorporate the uncertainty.

That these 12,000 plus observations we saw are from a universe

of, which includes an entire distribution of possible length of stays.

And if we want to compare characteristics of this universe or process, in

2011 to other years or to other health plans who are also

ostensibly sampling if you will, patients from unprocessed, a larger

process to fill up a data for one year, we do need to incorporate the uncertainty.

Because what we're making a statement is

about the uncertainty in the random process or

population if you think about it that governs the data we've seen in one year.

If we only want to talk about this health plan, and this

year of data in its own vacuum, then we don't need to

consider the uncertainty.

So in many situations where you might think that we have observed the entire

population of data, depending on what, what we want to use this information for,

we still may want to factor in uncertainty that comes from recognizing that these

data are manifestations of a larger process

of which we're seeing a snippet of.

So for example,

if we wanted to look at characteristics of MPH students in the year 2013 at John's

Hopkins, well ostensibly we have a certain amount of data on all such students.

But if we wanted to think of this year's class

as a manifestation or a sample from some larger process governing

the distribution of MPH students at Hopkins over time, then

we may want to consider the uncertainty in the estimates when comparing

our class to classes from other schools in this year, or to

previous years, from Hopkins. Finally, where does this 95% come from,

and why is it the industry standard? Now, that's a great question.

Ostensibly the reason 95% became popular in this utility

is because before computers crunching

numbers representing areas cut off under the normal curve was difficult to

do and tables had to be created by extensive integration by hand.

And so in the 1800s it was known and established that 95%

of the values under normal distribution fell within plus or minus two.

Well really 1.96 but we

rounded up to two for hand computations, two standard errors.

And because that was well know and 95%

was settled on as a reasonable level of coverage,

this is how this whole process started.

And ostensibly it's been carried on for tradition and because the fact

that 95%, it seems a reasonable high probability of capturing the truth.

We talked about in the previous sections, if

we wanted to extend for example, our level of

coverage or confidence to 99%, there's a hefty price

to pay in terms of additional width of the

interval, versus 95% just to gain 4%. This is not drawn to scale, by the way.

4% more coverage probability.

And we said we had to add an extra 0.5, almost 0.6 standard errors on either side

of the 95% confidence interval to get to the point where we have 99% confidence.

And that's a hefty price to pay in terms

of width to have gained that four percentage points.

So if somehow or another the culture

coupled with this artifact from pre-computer age and

difficulty of computing things under the normal

curve has accepted 95% as the industry standard.

And it's the

[UNKNOWN]

confidence intervals in research and other venues.

And so that's what we'll see in publications

and that's what we'll use in this course.

So in summary, we're just starting to crack this idea of inference.

The idea of extrapolating from our sample to a larger population.

And we're working on this idea of incorporating

uncertainty into the process.

And for now, it may seem a little abstract in certain cases to talk about a single

population and a range of possible values for some summary measure

on that entire population, be it a mean, a proportion, or an incidence rate.

But keep thinking about what we're doing here.

We're trying to give a range of possibilities for an unknown truth.

This will be, and this can prove already as it already

has, to be useful in quantifying some characteristic of a population.

Whether it be the health of population, through

its pulmonary health, based on average blood pressure.

Whether it be the anticipated response rate to a therapy

or quantifying the burden of disease in a population, etc.

In accounting for the uncertainty in the estimates we have.

But this will also become important and a little more transparent when we start

comparing populations between groups and we need to account for the uncertainty in

their estimates before we can make a clear distinction about whether the populations

differ on some key summary measure that would in turn result in a difference.

In the distribution of individual values or outcomes in the population.