A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

180 ratings

Johns Hopkins University

180 ratings

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

From the lesson

Module 2A: Summarization and Measurement

Module 2A consists of two lecture sets that cover measurement and summarization of continuous data outcomes for both single samples, and the comparison of two or more samples. Please see the posted learning objectives for these two lecture sets for more detail.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

All right.

So in this section, I'm just going to show

you some examples of what happens when we apply

these specific properties of the normal distribution, to data

that's not normally distributed or not approximately normally distributed.

This is sort of a warning just because sometimes these rules that

seem so, cut and dry, are misapplied to situations where they're not appropriate.

So upon completion of this lecture you will be able

to describe situations in which using only the mean and standard deviation of

a distribution of values, can characterize the

entire distribution, will not work so well.

Realize that these things we've called z scores are nothing special.

They're just a standardized measure of distance.

Understand that the z scores do not necessarily

align with the corresponding percentiles of a normal

distribution for data that doesn't follow, approximately a normal distribution.

And then finally, choose the right approach

to estimating ranges for individual values and

computing percentages greater or lesser than a

specific value, using non normal data distributions.

So just to remind you of, from the last bit about the normal distribution.

the normal distribution is the theoretical

probability distribution.

No real data are perfectly described by a normal distribution.

For example, in a true normal distribution, the

tails go on to negative and positive infinity, respectively.

No real data has that kind of range.

Although most, as we've seen several times

now of course, under a normal distribution,

most of the data falls close to the mean within two or three standard deviations.

As we saw in Section B, the distributions of

some data will be well approximated by a normal

distribution and, and as we showed in such situations,

we can use these properties of the normal curve.

To characterize aspects of the data distribution.

Like estimating percentiles, and ranges.

But, the distribution much continuous data that you'll see in

medicine, public health, et cetera, will not be well approximated

by a normal distribution.

In such situations, using the properties of the normal curve to

characterize these aspects of the data distribution will give you funny results.

And let's just take a look at some of the situations like this.

So here's an example from length of stay claims from the Heritage Health Plan.

These are all participants who had at least

one inpatient stay of at least one day in

the year 2011.

So this sample's rather large, constitutes over 12,900 claims.

And let me just show you what we're dealing with in this histogram.

This histogram shows the total length of stay distribution.

And you can see it's certainly not symmetric and bell-shaped.

It's pretty heavily skewed to the right.

And the summary statistics on this sample include our

estimate of the true underlying mean like this stay,

the sample mean is 4.3 days, and our estimate standard

deviation based on these 12,000, 900 plus claims is 4.9 days.

And I'm just going to superimpose a normal curve with that same

mean of 4.3 days, and the standard deviation of 4.9 days over our.

Histogram and you can see that, it's not a particularly good fit.

We're missing a lot of the left portion

of the curve.

So this extenuates again the right skew in our data.

Here's a box bot records inpatient just to drive this all home.

just to show you that the distribution is

right skewed, that all of the outliers are larger.

Than the majority of the data.

Suppose we wanted to use only the sample mean and standard

deviation and just, let's even know we've seen evidence to the contrary,

let's assume this data is well-described by a normal curve.

It's approximately normal, and let's use the

properties of the normal distribution to estimate.

The 2.5th and 97.5th percentiles of length of

stay, in the population these data were taken from.

In other words, the middle 95% of individual

values will fall between the 2.5th and 97.5th percentiles.

So if you applied the properties of normal curve, we know we can estimate Roughly

the 2.5th percentile by taking our sample

mean, and subtracting 1.96 or 2 standard deviations.

If we do that, look at our lower bound here.

It's negative 5.5 days.

If we go on holding off on the fact that that doesn't make much sense.

And compute the 97.5th percentile.

By taking the

same sample mean and adding two standard deviations,

we get an upper bound of 14.1 days.

So based on this, if we went with this,

based on this sample data, we estimate that most of

the persons making claims in this healthcare population had length

of stays between negative 5.5 and 14.1 days in 2011.

What the heck?

[LAUGH]

If you go and report this to the

national conference, or international

conference on healthcare utilization obviously.

You won't be asked back.

And clearly you all note that length of stay can't be negative.

That would essentially mean that the hospital sends you a check, right?

And that can't happen. So what's going on here?

Well, the problem is we've ascribed normatily

to the data that's not well approximated by

a normal distribution. And we get these goofy results.

Not every situation is this visceral and this hit you over

the head so easily to see that you've made a mistake.

Just to, to rectify this, if we went and used the computer to get the observed

2.5th and 97.5th percentile of these 12,298 values.

We get a one day and 20 days respectively.

And that would be the range we'd want to actually report.

We estimate that most people in this population, the middle 95%,

have length of stay days between one and 20 days respectively.

In this example, using the properties of the

normal curve to estimate the interval containing the

middle 95% of length of stay gave us

values of the claims population that are useless basically.

And again

just to reiterate in this situation, when you have

evidence that the distribution, the underlying distribution that your

sample comes from is not approximately normal, just go

ahead and use the observed percentiles of the data.

Suppose we were sitting back and trying to estimate for future claims data.

The proportion of the claims population in 2011, with total length of stay

greater than five days.

And we were using this maybe to project for.

So, you know, you might say, well this screams out let's create a z-score

and figure out where five days is relative to the rest of the distribution.

So that we can go ahead and do this, we can first find

this distance measure, the standardized distance measure

that we've called sometimes the z score.

And what we're going to do is we're going to take that five days measurement,

subtract off the mean of this sample distribution, 4.3 days, and

then standardize it by the number of days in the standard deviation.

And if we do the math here, the difference in day units is 0.7.

When we convert that to standard deviations, 0.7 divided by 4.9 is 0.14.

So we have a situation where the cut-off we

want to look at, five days, we want to figure

out what percentage of observations in our population would

have length of stays of greater than five days.

We've translated this to a z score 0.14 SD so it's akin to asking that under

a normal curve what percentages of observations are

more than 0.14 standard deviation above the mean.

And I'll let you verify we did enough of

these exercises in the last section so I'll let you

verify, that the probability of getting an observation that is greater than 0.14

SDs above the mean of a normal curve is 0.44, or 44%.

So if we falsely made this approximate normality

assumption we claim that little less than half, 44%.

Would actually have length of stays greater than five days.

However, if we actually go into these

data and look at some of the observed percentiles, just using the computer

to pick off some of the key percentiles and again, I'll have that data.

Piece that you can look at if you're interested in how to do this.

And you can see that I just listed some commonly used percentiles, and you can see

that the 75th percentile is five days, the number we were just looking at.

The 50th percentile

is two days, so there's actually a gap in between those two.

So I'd actually like to see if five days adopts earlier percentile value

because of the semi-discreet nature of

this data that we're treating as continuous.

So, I am going to flesh this out a little bit

more and pull up some other percentiles to understand the story.

And we can see, well, the 70th percentile was four

days, the 75th was five, so we're going to go with

five is approximately equal to the

observed 75th percentile of this distribution.

So, based on our actual analysis relevant to

our data set, we estimate that approximately 25% of

the claims, had a total length of stay greater

than five days, because that was our 75th percentile.

This percentage is a lot smaller than

that estimated 44% we got sort of falsely assuming

approximate normality, and using only the mean and standard deviation.

So, There's nothing wrong with using the

mean standard deviation to characterize percentiles when there's

some evidence that your sample data, comes from

a population that has an approximate normal distribution.

However, you can see if we don't have

that, situation then using the mean and standard deviation

is not particularly useful for

characterizing key points the distribution.

Let's just drive this home with another example.

So here's another example.

Here's CD4 counts, the distribution of CD4 counts.

For a random sample of 1,000 HIV

positive patients from a city-wide clinical population.

And look at these data, you can see

it's not as explicitly right-skewed as the length

of stay data, but you can still see there's

a pretty substantial tail in the positive direction here.

The actual mean of the sample of data is 280 cells per millimeter cubed.

And sample standard deviation is 198 cells per millimeter cubed.

And here's this distribution with the normal curve superimposed it

has that mean of 280 cells per millimeter cubed and

that standard deviation of the sample 198. And just to show you another visual

presentation that reinforces this as right skewed data, here's a box plot display.

And you can see that all of the outliers are larger

than the majority of the data, so again, another right-skewed distribution.

And let's suppose we just had stuck in our head, hey mean

plus, plus or minus two standard deviations will give us the middle 95%

of values, or estimate it for the population from which our sample came.

So let's estimate the 2.5th in 97.5th percentiles of CD4 counts.

In this population, based on the mean standard deviation of our sample.

So for the 2.5th percentile, we take the mean of 280 and

subtract 1.96 or for hand computations, 2, standard deviations,

where standard deviation is 198 sales per millimeter cubed, and

for the 97.5th percentile, we take that same sample

mean and add two of the same sample standard deviation.

And if you do the math on this, you can see if we came out

with a statement here we'd say we estimate that most of the population of HIV

positive persons from which our sample is taken.

Had CD4 counts between negative 116, and 676 cells per millimeter cubed.

Well that's actually biologically impossible, at

least to have negative CD4 counts.

So again, we're getting an interval that doesn't make a lot of sense.

And you might say well John, why don't

we just truncate that interval at zero, or one?

Rule out the negative

values, say they can't happen, and say that the, most of the 95% of

the population falls between say 1 cell per millimeter cubed and 676.

Well, let's compare either that approach or the nonsensical result we got to

the actual, observed 2.5th and 97.5th percentiles of these data.

If we do that, we get the 2.5th percentile

of 11, and we wouldn't do so bad if we truncated that other interval around 1.

But the upper bound of 722 is higher than we would have estimated.

Using the mean plus or minus two standard deviations.

So again, when you use the properties of normality and apply them to data

that's not well-described by an approximate normal

distribution, you're going to get results that A,

sometimes don't make sense, as we've seen with this example and the

previous one, And don't correspond to what's going on in the data.

So historically let's just use this to look through another question.

Historically guidelines about when to start antiretroviral therapy, have

changed over time very, very quickly sometimes over time.

Clinical recommendations have changed And a lot of times they've been

suggested as a function of CD4 count. And they vary by time and country.

So at one point in time, it was recommended that anti-retroviral therapy

be initiated for persons with CD4 counts lower than 350 cells per millimeter cubed.

So based on this sample of 1,000, let's

estimate the proportion of HIV subjects in the population,

with CD4 count less than 350 cells per millimeter cubed.

So let's first pretend that we didn't see that histogram, and

let's just assume that the distribution of CD4 counts is approximately normal.

If we translate 350 cells per millimeter cubed to units of

standard deviation, we can find out where this value is relative to

the sample mean of 280.

And we can do this in units of standard deviation.

So let's find the z-score or the standardized distance.

So, the actual distance, in original units 350 minus 280 is 70 cells

per millimeter cubed. 350 is 70 cells greater than the mean.

But if we standardize that to units of standard deviation, where there's

190 cells per millimeter cubed per standard deviation, this 350 cells per

millimeter cubed is approximately 1 3rd of a standard deviation above the mean,

0.35 standard deviations above the mean. So

again, I'll let you verify that the proportion of observations that are less

than 350 in this sample would be

the proportion of observations that are less than

a 0.35 standard deviations in a normal distribution.

And if you fill out that left part of the distribution, all the values to the left

of .35 standard deviations, this proportion is approximately 0.64 or 64%.

So what we would say based on this analysis.

Assuming normality even if it's not appropriate is

we would estimate that roughly 64% of our population is eligible to start anti

retroviral therapy if our cut off is being below 350 cells per millimeter of impuric.

If we actually go and look at the empirical percentiles of

these data, 350 corresponds to the observed 70th percentile.

So actually,

if we used this approach, we would estimate that roughly

70% of our population was eligible to start antiretroviral therapy.

So if we had used the properties of the normal distribution

with these data like we did on the prior slide, we would

underestimate the proportion of persons qualifying for antiretroviral therapy, by

roughly 6%, and that can cause problems in terms of allocating resources and such.

So again if you are

interested in the mechanics of this, or how I got these percentiles,

check out the accompanying Stata video, if not just take them on faith.

But I just wanted to show you these examples.

Because many people like rules, and they cement them in their brains

pretty quickly, and forget that they are only valid under certain conditions.

So a lot of people think instantly, a mean, plus

or minus two standard deviations will always give me the middle 95% of

values in the distribution, but that's

conditional upon the distribution being approximately normal.

So I did actually ask you to verify some

of the computations with the normal curve in this section.

If you have trouble with that, in section D, I'll take you through in detail.

The process of looking things

up in normal tables to answer these questions.

Remember, that's not the goal of the course,

looking things up in tables, but it will

give you perspective on how the curve is

shaped, and the distribution of values under the curve.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.