A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

来自 Johns Hopkins University 的课程

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

235 个评分

Johns Hopkins University

235 个评分

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

从本节课中

Module 2A: Summarization and Measurement

Module 2A consists of two lecture sets that cover measurement and summarization of continuous data outcomes for both single samples, and the comparison of two or more samples. Please see the posted learning objectives for these two lecture sets for more detail.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Hello, and welcome back.

So, in this section I'm going to try and

take you through some practice exercises related to our understanding of how

the mean and standard deviation of a approximately normal data can be used to

estimate percentiles and percentile ranges in the data.

As opposed to actually going into a computer and

computing those percentiles empirically based on the data.

And so we'll just run through some examples.

The way this is going to work, is I'm going to set the stage for these examples.

And then I'm going to suggest you pause the video,

do the exercises, and then resume, and I'll take you through the solutions.

So in some sense, this is your one stop shopping for

some exercises regarding the normal distribution.

Okay, so let's look at the first scenario.

So, I'm going to pull some data from the 2011 Youth Risk Behavior survey, and

that contains self-reported data on a bevy of things, but we're going to focus, for

the moment, on self-reported weight and height values for

a large sample of US youth, 12 to 18 years in age in the year 2011.

And these values can actually be used, you know, they've got weight and height,

to compute body mass index, which is a one point in time measure of body mass.

So for this exercise, this first set of questions,

we're going to focus on the data for the 1,860 16 year olds in this sample.

And so I'll just give you some summary statistics.

The mean BMI for this group is 23.6.

So our x bar for this sample is 23.6.

And the standard deviation estimate based on these 1860 values is 4.9.

So, in youth there is actually no singular cutoff for obesity with a BMI.

The standard approach for persons under 18 years old is to use

the 95th percentile for the BMI values for a sample children of a given age.

So based on the information on the previous slide, I'd like you to

estimate the cutoff for obesity in the population of 16 year old males.

In other words, using the mean and standard deviation and

assuming normality of these data, estimate the 95th percentile for this group.

And then in adults.

In adult populations, BMIs between 18.5 and

24.9 are considered indicative of healthy weight.

So I'd like you to use these data on 16-year-old boys to actually estimate

the percentage of 16 year old boys, whose BMI falls between 18.5 and

24.9, again assuming the BMI data is approximately normally distributed.

Okay, so now I'm going to let you loose to do this and I'm going to suggest you pause

the video, work on these and then when you resume, we'll go through the solutions.

Originally I planned to bring a guitar and sing to you while, while this went on but

the video crew wisely said they would not record me if I did that.

So I'm going to suggest you pause for

the moment and we'll start up again in a minute.

So getting in the question I first posed, the first part of exercise one,

I said in youth less than 18 years old there's no singular cutoff for obesity.

The standard approach is to use the 95th percentile for

the BM of BMI values for children of a given age.

So I said based on the information given before, the mean and

standard deviation for this sample of 16 year old males, I wanted you to

estimate the 95th percentile or the cutoff for obesity in this population.

So I'm going to use just, just variety and

again this course isn't about looking things up in tables,

you can certainly find computer applications that will do the same thing.

But just to give you a little practice with this logic and

the idea of using tables, I'm going to focus on that second table I showed you in

the first set of lectures regarding the normal curve.

And what this table gives us for any number of

standard deviations is the percentage of values that are lesser than that.

So it gives us that percentile.

But it only focuses on percentiles of the 50th and below.

And so we're going to have to use the logic of

the normal curve to translate this into what we want.

So I'm going to blow this up a little bit, and

focus on the line that's going to tell us what we need.

And if you actually go and look at this table it starts at 0 standard deviation,

and that's the 50th percentile, that's right at the median, and

then looks at numbers of standard deviation less than 0.

And if you actually, an, it's going to be hard to see even on this blown up slide,

but if you actually do a little bit of the work and look at this table or

another one like this, you'll find that the value that cuts off 5%.

See, here's my little drawing, 5% below it, or if you look at this it's 0.495,

which is starting 0.0495 which is approximately equal to 0.05.

The number of standard deviations that cuts that much to the left.

Or, in other words,

the 5th percentile is defined by negative 1.65 standard deviation.

So.

If we want the 95th percentile we're going to use the symmetry of

this normal curve.

And know that, you know, if we went 1.65 standard deviations above

the mean we'd of cut off 5% in the upper tail, and

95% of the values under the curve would be less than that 1.65 standard deviation.

So, we can use that fact now to estimate the 95th percentile for these data.

Let's go to the next slide and let's write this out.

So, again we have the mean, so we can basically get the 95th percentile and

using this logic that we've talked though, we could take the mean.

Plus we're only going the 95th percentile, the upper cutoff,

1.5 times the estimated sample standard deviation.

If we plug in our values, 23.6 is the mean BMI for

this sample plus 1.65 times the estimated standard

deviation of 4.9 we get a 95th percentile of 31.7%.

So, this is what clinicians would use based on these data.

This is what they'd use as the cut off for obesity in defining that for

16 year old boys.

Okay?

So in this next question I said, well in adults, BMIs between 18.5 and

24.9 are considered indicative of healthy weight.

This is not the applicable range to 16 year olds, but for,

just to understand what that would mean about the distribution for 16 year olds.

I asked you to estimate the percentage of 16 year old

males who have BMIs in this range.

So what I'm asking for, so we assume normality of the BMI distribution which is

the reason I'm not showing you the data directly in these slides, but it is well,

you know, the histograms are suggest that they're all roughly normal distribution.

We've said, well it has some standard deviation of about 4.9 and

the estimated mean, based on these 1600 males, is 23.6.

Now, what i want you to do is actually go in and

try and estimate, using only the mean standard deviation,

the proportion of such boys, who have BMIs between 18.5 and 24.9.

this is not drawn to scale.

This proportion here.

So how are we going to go about doing this?

Well, we know, let's think about this.

We know that we can translate any value under an approximately normal

curve to a estimated percentile.

Given the mean and standard deviation.

So we want to figure out, you know, what percentile values of 18.5 and

24.9 correspond to and then we can

use that to answer what percentage of the curve falls between those two percentiles.

So to start, for the lower, bound is 18.5,

let's translate this into a distance in terms of standard deviation,

how far does it fall from the mean of the sample in terms of standard deviation.

So, if we do that we can take, we can call it z if we want,

take the different from between that value and the mean, 23.6, and

convert that to units of standard deviation.

There's 4.9 units in the standard deviation.

So, if you trust me on this and you can verify this on your own.

If you do this, this turns out to be about negative 1.04.

So in this distribution the lower bound for healthy weight is

about 1.04 standard deviations below the observed mean in these boys.

For the upper bound, for

the upper bound let's see where that falls relative to the observed means.

So we do the same kind of calculation,

18.5 minus the mean, sorry, the 24.9.

Minus 23.6 and we'll divide that by the number of units in the standard deviation.

And if we do this you sh, you,

you should get something close to about 0.265 or roughly 0.27.

So our result is this upper bound is about 0.27th 27 one

hundredths of a standard deviation above the mean of these boys.

So let's translate these into percentiles.

I'm going to, I'm going to show you a schematic here, but

I'm going to leave it to you to verify with the tables I've given you or

in online, you can find plenty of online calculators that'll do this for you.

But if we do this, if you actually use the tables, I'm going to erase this here,

if you actually use the tables, to translate these standard deviation or

z-scores for percentiles under normal curve,

what you're going to find is that that lower value of negative 1.04,

roughly 15% of the measurements under a normal curve fall lower than that,

and the other remaining 85% are above that.

And that makes sense, because we've seen, you know,

that one standard deviation is roughly 16%.

So this negative 1.04 corresponds to,

roughly, the 15th percentile of the distribution.

And if you do the same for the upper bound, the 0.27, with a little work,

with a little bit of work you can find that roughly 61% of the values

under a normal curve are less than 0.27 standard deviations above the mean.

So this 0.27 corresponds roughly to the 61st percentile of the distribution.

So the percentage of values that would fall between the 15 percentile and

the 61st percentile is really just the difference in those two percentages.

61st percentile minus 15 would give us roughly 46%.

So about almost half of the boys,

46% of them have what would be considered a healthy weight based on adult standards.

Okay, now we are going to do another exercise based on

approximately normally distributed data or the assumption there of and

this is going to be actually from a clinical sample of HIV patients.

That we've looked at this previously.

And we're going to look at the subjects' log based ten viral load prior to

initiating anti retro viral therapy.

And we'll also have the information on whether the subject responded to

the therapy measured at 16 weeks,

and it's defined as 100 fold decrease in HIV one viral load.

So what does the 100 fold decrease mean, and how does it relate to logarithms.

So let me just take a moment to, before I set up the questions that I

am going to have you do just to remind you of something about base 10 logarithms.

Okay, so just a reminder about base 10 logarithms, just a short review,

so for base 10, this may be the easiest base to work with in logarithms, but just

some key things, let me just give you some examples and then we will talk about this.

So, the log in base 10 of the number 10 is just

1 because the base of 10 raised to the first power gives us back 10.

A log of 1,000 for example, in base 10 is 3.

Because if we take the base of 10 to the third power we get back 1,000.

So a decrease, right?

A decrease, a 100 fold decrease in the HIV viral load would

be akin to something like, going from 1,000 down to 10.

Because the ratio of this new value, 10 to a 1,000 is 0.01 or 1 one hundredth.

That's what they mean by a 100 fold decrease.

What would that translate to on the log scale?

Well, on the log scale this would translate to a difference in logs.

Now for the log of 1,000, their prior measurement minus the log of 10 is 2.

So it would be a decrease by two for the log scale.

So this is just a, so why do scientists do this?

Why do they put viral loads in the log scale?

Well generally they're interested in relative comparisons,

the order of magnitude difference between before and

after measurements or comparing patients with HIV.

And so they're less interested in the absolute difference in the actual

viral load values and more interested in the relative comparison.

So, looking at the difference in the log scale tells us there.

Looking at these values, the log scale gives us a reference point for

the order of magnitude.

Those with log values between 2 and

3 I've underlined viral load values in between 100 and 1,000.

Those with log values on the order of 4 have viral load counts on

the order of 10,000.

So that's why we're looking at the log of these data for this exercise.

So what I want you to,

I'm going to tell you a little bit more about the distribution here.

Data were collected on 1,000 HIV positive subjects from this

clinical sample as we discussed before.

And the information includes each subject's log based ten

viral load prior to initiating anti-retroviral therapy.

And, we also know whether or not each subject responded to therapy measured at

16 weeks and defined as the 100-fold decrease we talked about before.

So let me just give you some summary statistics on the pre-therapy log base 10

viral load distributions for the responders and non-responders.

So the responders, their average,

their average log viral load of the base 10 is on the order of 4.8.

Close to 5, which would correspond to a 5 or load measurement close to 100,000.

And this s of 0.57 measures the variation between log viral load values for

this group of responders.

For the non-responders among these 1,000 you

can see that their average log viral load is fuller on average, closer to 4, 4.2.

And there's more variability in these values.

Slightly more variability as the sample standard deviation is on

the order of 0.7, 0.68.

What I'd like you to do in these exercises is I'm

going to ask you a couple questions.

And I want you to see if you can use only that mean and

standard deviation with the assumption of normality to answer these.

So, the first question I'm going to ask you is the log viral loads in the samples

of responders and non-responders are approximately normally distributed.

Based on these results,

I want you to estimate a range of log viral load values for the middle 95% in

the population of those with HIV who would respond to therapy and those who wouldn't.

So this is just that classic middle 95% of the distribution values.

And now I'd like you to estimate the 90th percentile of the pre-therapy log

viral load for those who didn't respond.

Using again only the mean and the standard deviation, how can you relate the mean and

standard deviation with the assumption of normality to estimate the 90th percentile.

The value that's greater than 90% of the observations but

less that the remaining 10%.

And then finally, I'd like you to suppose for a moment that the distributions of

the log viral loads for all 1,000 subjects in the sample are actually right skewed,

maybe for both the respondent and non-respondent group.

Based on only the given values that I've given you, the mean and

standard deviation, can you estimate the range of log viral load values for

the middle 95% in the populations of those who responded to therapy, and

those who did not respond to therapy?

So, again, I'm going to fade out and I suggest you pause the video for

the moment, but I want you to think about what you can and can't do with

the information I've given you and where you can use it to answer these questions.

So that means using the mean and standard deviation coupled with the assumption of

normality to specify these percentiles.

So, welcome back.

Now let's look at some solutions to these exercises I've asked you to do.

So the first one was I asked you, I said the log viral loads in the samples of

responders and non-responders are approximately normally distributed.

So based on these results, I said, estimate a range of log viral values for

the most, the middle 95% persons in the populations who responded the therapy and

those who didn't.

So let's first do the responders.

And, and, so

this is the basic classic middle 95% of an approximately normal distribution.

Using only the mean and

standard deviation, we can estimate that by taking the mean and add and

subtracting to the standard deviations based on our sample data.

So for this group of responders the mean and

the viral load on the log scale was 4.8 plus or minus 2 standard deviations.

We could use 1.96 if we were being exact, but for

hand computations, it's easier to work with 2.

And this gives us an interval of 3.,.

you can check my math, of 3.66 to 5.94.

So we'd say that most of the persons who respond to therapy had

starting log viral load values of between 3.66 and 5.94.

Somewhere on the order of less than 10,000, all the way up to,

close to a million.

If you actually looked at the original scale.

Now let's do the same thing for the non-responders.

We'll do the same approach, we'll take the observed sample mean of 4.2 on

the log scale, plus or minus 2, estimate the standard deviations.

And if we do the math correctly, hopefully I did,

we'll get an interval of 2.84 to 5.56.

So most individuals who didn't respond had starting measurements on the log

scale between about 2.8 and 5.6.

That's just exploiting the classic property of the normal curve for

roughly normal data.

Most, about 95% of the distribution, the middle, will fall within plus or

minus two standard devs of the mean.

Then I asked you to estimate the 90th percentile of the pre-log viral load,

the pre-therapy log viral load for those who did not respond to the therapy.

So we want, you know, for that distribution of those who didn't respond,

you know, we know their mean is 4.2.

You have a standard deviation of about 0.68, and

we want to estimate the value in that, you know, is greater than 90%

of the values in the distribution, and less than 10%.

So, in order to get this we'd actually have to go to another normal table or

online calculator to figure out how many standard deviations cuts off

90% below it and 10% above it in the normal curve.

And if you actually look this up, and

you can verify I've done this correctly, it's about 1.28 standard D.

So if we actually take the mean of our sample of the non-responders.

And add 1.28 sample standard deviations, we will get a reasonable estimate

under the assumption of normality of the 90th percentile for

the log viral load value amongst the non responders.

And so, if we do this, the mean, if you recall, is 4.2.

We're going to add 1.28 times the standard deviation that was 0.68.

Now, if you do out the math, this should be about 5.07 or roughly 5.1.

So, so that this gives up an upper bound.

90% of those who did not respond to therapy had viral loads less than,

estimated 5.1 on the log scale.

The other 10% were greater than that.

Finally I said, well suppose for a moment, that the distribution of

the log viral loads for all 1,000 subjects in the sample are right skewed.

In both the group who responded to therapy and the group who didn't.

So, based on the given results that I gave you,

mean and sd, estimate a range of log viral load values for the middle 95% in

these respective populations those who responded and those that didn't.

And well, if you think this is a trick question, you're absolutely right.

If we don't assume normality, we cannot, as we demonstrated, we cannot necessarily

use the mean of center deviation to get reasonable estimates of these ranges.

So we'd sort of be out of luck.

We can't use these summary statistics alone.

And if you were working with somebody who had the data,

you could ask them to actually use the computer to give you the actual 2.5th

and97.5th percentiles in that set of data, but

you can't properly estimate these using only the mean and standard deviation

of these data if they're not approximately described by a normal curve.

Okay, so one last question I'm going to put forward is, you know,

I just want you to start thinking and we've talked about this previously,

we're working our way toward statistical inference and I want you to start thinking

about you know, how would you compare groups based on summary statistics and

we've talked about this before but I just want to put this in your head again.

So I was going to say based only on the sample statistics I gave you for

the responders and non-responders in terms of the law and viral load.

The mean and standard deviations may be suggest one way or

multiple ways to quantitatively compare log viral distributions for

the responders compared to the non-responders.

And I'll just give you a minute to think about that if you want to pause the video.

See, if there's some one number summary that could get at and

describe something about the difference in these underlying distributions.

Okay, so let's just take a look at this here.

What I'm showing you here are histograms of the, I didn't show you these before,

but, now I'm showing you, these are histograms of the log viral loads for the.

There's total across these two histograms of about a thousand observations and

they're split into those who responded and

those who didn't and I've put this smooth curve on top.

It sort of takes a weighted average of the heights of the values in

the histograms just to give some sense of what the underlying shape of

the population is that these data come from.

You can see they're,

you know, approximately normal, maybe it's a bit of a stretch.

The for those who did not respond there's a slight right tail here.

For those who did respond there's a very small left tail here.

But on the whole the majority of the data is symmetric and

bell shaped seems around in center.

So if we go ahead and, you know, compare you might say well these,

this distribution looking at these is very useful because you can sort of see.

That the distribution of the log based ten values,

even though there's a lot of crossover in the individual values,

it sort of shifts up for the responders relative to those who didn't respond.

So is there any way we can capture that positive shift for the responders

compared to the non-responders using single number summaries?

And we could, we could do it in a bevy of ways, but

something we're going to be doing frequently is looking at

the difference in means between the two distributions.

So we looked at the mean, remember the mean for

those who did not respond, which is this top curve, was on the order of 4.2.

Somewhere around there.

And the mean for those who did respond was about 4.8 on this log scale.

So this difference here sort of captures in a single number somewhere if we

took the mean difference for the responders,

when it's the non-responders 0.6 from the log scale,

That 0.6 sort of captures that positive shift in the entire distribution.

Sort of we're measuring the centers or some measure of center for

the distributions, how they compare to each other and

you can see that sort of encapsulates a,

a single summary measure where those distributions lie relative to each other.

We could also do it with, by comparing the 90th percentile of these distribution or

the 40th percentile but we're in many situations.

What you will see done in the literature and

what we're going to use is the difference in means.

And the reason we'll get to in the next set of lectures is that we,

we have a very good understanding of the uncertainty in

our sample mean estimates and can bring that into the story easily.

We don't have as good of an understanding of the uncertainty for

other estimates sometimes, like the percentiles.

So think about this.

If the means shifts on the order of 0.6, this is on the log scale.

So in some sense, a relative comparison of the log viral load values for those

who responded to those who didn't is, you know, relative comparison on the actual

scale is on the order of 10 to the 0.6th power, which is about equal to 4.

So the typical, in some sense, the typical patient who responded to therapy had

viral load values drop of four times that of the typical patient who didn't.

So.

Just interesting to note that those who tended to be sicker and had higher viral

load values responded better to therapy, as at least by these data.

So in summary, just to.

What are the big picture items I want you to get out of this set of lectures and

these examples we've done here?

Well.

Given only summary statistics, the mean and standard deviation,

we can estimate percentiles of approximately normal distribution or

distributions using only the estimates mean and standard deviation.

This works well if we're willing assume or we have some visual evidence that the data

distributions we're dealing with are approximately normal that we distributed.

This does not work so well as we've seen for non normal data.

Okay, so you might say well John you know, if this only works well for

normal data and not, not all data is approximately normally distributed well

why don't you just use the observed percentiles in the data.

And that's a fine and

easy to do if you have all the data in a statistical package at your disposal,

you can say well I'm not going to assume anything about the distribution.

Or, if you want to at least check whether the assumption is reasonable,

you can look at the histogram, et cetera.

And, if you don't meet that assumption, you could easily,

in a statistical package, pull up the percentiles of interest, the 2.5th and

97.5th, to get that middle 95% range or any other percentile you want.

But many situations you're not going to be privy to the entire data set.

So with published data sometimes all we're privy to is an estimated mean and

standard deviation for the distributions of interest.

If we're lucky there'll be a visual display of the main variable of

interest as well, perhaps a histograms for the groups we're looking at or

box plots, and if we see evidence in those pictures of rough normality, we can use.

The only statistics they provide, the mean and SD, to make some estimates about

percentiles of the distributions, even if they're not show in the article per se.

But where this is actually going to be very useful and

very interesting is where we get to statistics inference.

The idea of not only estimating things from samples but getting

some understanding of the uncertainty associated with our estimates and using

that to create ranges of possibilities for the truth we were trying to measure.

We're only going to have an estimate of the mean and standard deviation for

an approximately normal distribution that describes some behind the scenes behavior.

But if we have that estimate of the mean, the estimate of the variability.

And we know this distribution that we can't observe directly, is approximately

normal, we've got all the information we need to describe that distribution.

So, onward and upward, as we move through the course.