A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

来自 Johns Hopkins University 的课程

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

238 个评分

Johns Hopkins University

238 个评分

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

从本节课中

Module 2A: Summarization and Measurement

Module 2A consists of two lecture sets that cover measurement and summarization of continuous data outcomes for both single samples, and the comparison of two or more samples. Please see the posted learning objectives for these two lecture sets for more detail.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Greetings, and welcome back!

This is going to be our third lecture

in the Statistical Reasoning One series, and today

were' going to talk about a famous, some

might say infamous, distribution called the normal distribution.

Many of you have heard of the normal distribution.

You may even be familiar with some of its key characteristics.

It's bell shaped, it's symmetric around its center.

And the tails die off quickly.

In other words most of the observations that are described

by a normal distribution.

Fall close to the center of the distribution.

Now we're going to spend a little time trying to understand the properties of it.

You might say but why are we doing that?

Is it because most data that we'll see

in public health and medicine is normally distributed?

And the answer is no, not necessarily.

We'll see, for some types of data, types of

continuous data, the normal distribution is a reasonable working model.

And we can use its properties to better flesh

out the distribution of the data from the

population from which the sample we have is taken.

But in other situations, these properties that are specific

to the normal curve aren't going to get much ground.

However, when we focus on our next unit statistical estimation of confidence

regions and inference the normal distribution is going to prove invaluable.

So far, and up to this point, including in this

lecture set, we take our estimates from samples as is, that is, we look at

a sample mean and we say this is

our best estimate of some underlying population truth.

And we know it may not be exactly equal to that unknown underlying truth.

Well, in the next set of lectures post

this, we're going to get into the idea of, can

we put uncertainty bounds on this estimate to

get a rating of possibilities for this unknown truth.

And that's

where the normal distribution is going to be invaluable.

So, what we're going to do, there's three sections, didactic sections

to this lecture and then one set of practice problems.

And what we're going to do is first

define the properties of the normal distribution, and

show how we can define it perfectly just by knowing it's center or it's mean.

And the spread of the values under the distribution, the standard deviation.

And there's some general rules about it.

In the second section,

section B, we're going to look at some

data examples where the normal distribution is a

reasonable model for the individual observations in

the population from which our samples are taken.

And show how we can exploit the

properties to better understand the underlying population distribution.

In Section C we're going to see, well you know, sometimes in many cases the

data we get in public health and medicine is not well described by this perfectly

symmetric, theoretical distribution.

And we'll see that if we actually apply the

properties of this said normal distribution to these data.

We're going to end up with useless results.

And it's just to remind you that, you

know, some things are only applicable under certain conditions.

So in this section, this first section A,

we're going to actually define some characteristics of the

normal curve, and hopefully upon completion of the

lecture, you'll be able to actually describe the basic

properties of the normal curve.

Describe how any normal distribution is basically

completely defined by its mean and standard deviation.

Recite what I'll call the 689599.7% rule for

the normal distribution, with regards to standard deviations.

And hopefully feel comfortable or beyond your way

to feeling comfortable working with standard normal tables.

Now I'll be honest, I'm not going to require

a lot of that in this course, an it's

not a big focus, but it's, gives you some appreciation

for how quickly, observations fall in their likelihood under a

normal curve the further you get away, from the center.

Okay.

So let's get started.

The normal distribution is a theoretical probability distribution that is perfectly

symmetric about its mean, and because it's perfectly symmetric about its mean,

the mean, median, and mode are the same, and it has a bell-like shape.

So frequently you also hear it referred to as the bell curve.

Where does this distribution come from?

Who invented it?

Well, the normal distribution was also

called the Gaussian Distribution in honor of

its inventor, Carl Friedrich Gauss, and here's a picture of the man himself.

For those of you who either lived in Germany,

or were in Germany pre-euro you may recognize this picture

because Carl Gauss's feature was on the deutschmark.

And so in the US, we're hoping to, you know, make the presidents move over

and get some statisticians on dollar bills, and

maybe you'll see me on the ten someday.

Normal distributions are uniquely defined by two qualities.

All we need to know, if we know data comes from a normal disribution, if we want to

completely characterize the distribution of that data, all we

need to know is its mean and standard deviation.

I'll generically represent these with the symbol mu, and

standard deviation sigma to imply a population level mean.

there are literally an infinite number of

possible normal curves for every possible combination

of the mean, in standard deviation.

So here I'm showing some pictures of curves

that have different means and different standard deviations.

You could keep adding to this add infini-,

make t hem wider, skinnier, at different centers.

But you'll notice that these three different

examples I have here, any other examples

of a normal curve would all have

the same proportional structure, that is that they're.

Uniquely excuse

me, they're centered about their mean, and evenly distributed about.

Okay, so for this next slide, I'm just showing

you this, not to scare you off of math.

Many of you like math and are comfortable with it.

But if you're not comfortable with it, don't worry about this.

I just want to show you sort of the beauty of mathematics, and I get

to have the opportunity to have it over my shoulder, which is always a nice perk.

But I want to show you, for any given value

under a normal curve, the proportion of values that take

on that number, the probability of observing a value

equal to that is described by this function here.

And this function is sort of a math majors dream in

some sense, it's got all kinds of symbols and notation in it.

Two of the symbols I want to point out are the

pi symbol, which actually represents a constant, a number roughly 3.14.

And also the e, which also represents

a constant, or number, called the natural constant

of 2.718 or so.

So, once we deal with those constants, the only other

two symbols in here are the mu and sigma, and

the only reason I'm showing this equation to you is

to make you appreciate that this curve was completely specified.

We can figure out where a particular value falls under the curve, only by

knowing that value and the mean and

standard deviation of the distribution it comes from.

So again, all normal distributions, regardless of the mean and

standard deviation values, have the same structural properties, mean equals median

equals mode, the values are symmetrically distributed about the mean,

and values closer to the mean are more frequent or likely.

Than values further from the mean.

The entire distribution of values described by normal distribution.

Again, I've said this

before but it can be completely specified

by knowing just the mean and standard deviation.

Since all normal distributions have the same structural properties.

we can use a reference distribution called the standard

normal distribution to elaborate on some of these properties.

And we'll define the standard normal distribution in

a minute, and in section B we'll show that

any normal distribution with any mean standard, so the

deviation can easily be scaled to this reference distribution.

So, here's the first one.

This is just something you are going to have to memorize.

The only characteristics of the normal curve that I

want you to take to heart, and you can

always look these up in the table, but hopefully

you will be able to internalize these pretty quickly.

So, I'm just telling you, and I'll show you where this comes from, but

if I'm dealing with a normal distribution,

regardless of the mean and the standard deviation.

If I'm standing at the mean, in the center, and I go one standard

deviation either direction of that center, I

encapsulate 68% of the observations under that curve.

So this shaded red area here is 68% of the entire curve.

There are several ways to actually state this.

We could say for data whose distribution is approximately normal, 68%

of the observations fall within one standard deviation of the mean.

We could also say the same thing just

rephrasing it in terms of a probability is, the

probability that any randomly selected value is within one

standard deviation of the mean is 0.68 or 68%.

Those are two ways of saying the same thing.

Let's get to the second part of this rule.

This is one you may be familiar with, but 95% of the

observations under a normal curve fall

within two standard deviations of the mean.

Truthfully, it's 1.96. Computers will use that number.

You can look it up in a table, but for quick and dirty.

Back in the other compilation, computations

is absolutely fine to use, too.

So if we're staring at the mean of a normal curve.

And we go

two standard deviations above and two standard

deviations below, we'll capture 95% of the curve.

And if we actually go three standard deviations from that

center, we'll capture 99.7% of the observations they fall within.

So and almost all the values that take on a normal

distribution fall within three standard deviations

of the center of that distribution.

Okay, so let's just consider this for a moment.

If we say that 95% of the observations fall within two standard deviations

of the mean, then again, it's really 1.96 but we'll work with two.

Okay. Let's just consider this for a moment.

Let's think about this for a moment.

So, what would that mean about, the proportion of observations

that are more than two standard deviations above the mean.

Let's think about this. Can we use the logic

of the normal curve and it's symmetry?

We're encapsulating 95% in the middle, this red area.

So that means the entire area of the curve would be a 100%.

So that means what we haven't covered in this middle

territory com, encapsulates the total of 5% of the observation.

Right?

And because the curve is symmetric, that 5%

that we have en-captured in that middle 95%,

should distribute itself equally on both sides.

So, the proportion of observations that are greater than two

standard deviations above the mean, is half of 5% or 2.5%.

Similarly, the proportion of observations under a normal curve that falls more than

two standard deviations below the mean, is also 2.5%.

So just to recap what we've done with the number

two standard deviations in reference to the normal curve, we've

said that the middle 95% of values that take on

a normal distribution fall, within two standard deviations of the mean.

They fall within the interval, mean minus two

standard deviations, and the mean plus two standard deviations.

If we were randomly sampling

data points from data that followed normal distribution the probability

of getting a value in this interval would be 95%.

So what does this mean in terms of percentiles?

Well, let's look at this.

The lower end point here, is the point which is 2.5% of the values under

the distribution are smaller than, and hence a 97.5% are greater than.

So, the 2.5th percentile of the normal curve

is equal to the mean minus two standard deviations.

Conversely, 97.5% of the values are smaller than, and

2.5% are greater than this upper end point of mu plus two standard deviation.

So, this upper end point is the 97.5th percentile of the normal curve.

If we were to actually line up in order, from smallest to largest,

all values under.

That follow a normal distribution actually pick

off, empirically, the 2.5th and 97.5th percentile,

they would closely correspond to these estimates

based only on the mean and standard deviation.

Let's again look at the 68% part for a minute.

We know that 68% of the observations in the normal

distribution are in the interval within one standard deviation, 68% here.

' Kay.

So let's just think about this for a minute.

What percentage of the observations that following a normal

distribution are more than one standard deviation above the mean.

They can also be phrased, what is the probability that an individual observation

is more.

Than one standard deviation above the mean, normal distribution.

Well, what are we talking about here? We're talking about this area here.

Let's just see if we can figure that out using the logic of the normal curve.

Well, we know from the rule I've given you that

68% fall in that red area within one standard deviation.

So the total outside of that red area on either side, is a

hundred percent of 68% which is 32%.

But of course by symmetry of the normal curve those two tails

here which total contain 32% of the distribution will contain it equally.

So, in each of these tails, roughly 32% divided by 2, or 16%

of the area, percentage observations fall. So 16%

of the observations that take on a normal

distribution are beyond one standard deviation above the mean.

If we wanted to look at what percentage of

observations fall in the normal distribution, or more than

one standard deviation away from the mean in either

direction, either above the mean or below the mean.

But we've sort of already answered that, but that would

be the percentage beyond on standard deviation in either direction is

that 100 minus 68%. For that 32%.

So where did this rule come from? Did I just make this up?

No, in other words, how did I know these relationships?

Okay, well, it turns out that there are tables that exist for this.

The actual figuring this out, it would be difficult to do with that formula I

presented you before because it would require

integrating all of the ranges of the data.

So it's nice that people have come up with tables

for us to look at, and you might say, well, that's great John, this

rule is useful, but what about other

percentages under the curve for other standard deviations?

Distances from the mean, you know?

Not just one, two, or three.

Well, all the information I quoted and much more

can be found in what's called the standard normal table.

So here is an example of the standard normal table.

This is just maybe the greatest hits of a table just to get you thinking about it.

me-,

may, the tables represent themselves in different

ways, depending on where you find one.

And we'll speak more to this in a minute.

But you'll notice I've got three columns here.

Most of them will only include one of these.

Descriptions, but we've shown already by the logic of the

normal curve, its symmetry, et cetera, if we're given one piece

of information about a standard deviation in area under the curve,

we can figure out the other bits by employing that logic.

So this table here actually has three columns, most tables

won't be so ornate, but in this first

column, it just shows you what percentage falls within.

Z standard deviations of the mean, Z is this column here,

so for example, if we're looking at one standard deviation, we can

see that 68%, I mean I rounded it in my lecture,

but it's really 68.3% fall within one standard deviation of the mean.

Another way of saying the same thing, is if we were

to go to one standard deviation above the mean, and

look at the percentage of observations that are greater than that.

It would be, we've already shown the logic of that 16% that we showed before.

And if we were tp actually look at the percentage that are outside of the middle.

One standard deviation range it would be that 32% we showed before.

Similarly you could check this for different numbers,

you could see that for two standard deviations.

Well truthfully its 95.5% that fall within.

Two standard deviations, but again, we're going to, and, and 1.96 is the

cutoff for 95% but, when we work back of the envelope calculations,

you could think of two as the number that cuts off 95%

of the middle and a total of 5% outside of that range.

So, you know, you might say well where do I find one

of these standard normal tables in case I need to do this.

Well, reminder, we're in an online course, which means you have access to what?

[LAUGH]

The internet.

And if you type in standard normal table

on the internet, you can get multiple hits.

You can even find calculators where you can plug in a number of

standard deviations of interest and it will tell you something about the curve.

And so I'm just going to show you two examples

of tables just to work the logic of these.

You have to, you can also find these in the back of any statistical textbook.

But there are many ways to tell the story of this same curve, and

so you have to pay heed to what a particular table is telling you.

So this is one I went and searched on standard normal tables.

This is one of the hits I got. Here is the URL.

Hopefully it's still working by the time you look

at this lecture, but if not, there's multiple other ones.

Clearly you can't see this on the slide, so I'm going to zoom in a little bit.

Move over to the side and just let me peh, show you

what it's telling you wi, with reference to the values in the table.

For any given,

you have to pay attention to the fine print, for

any given standard deviation what this table is going to tell us.

Is the percentage of the observations that fall

from the mean to that standard deviation value.

So it's not telling us about the full range

within that, only part of what we looked at before.

Well we'll see if we can use that to map to numbers for comfortable width.

So for example if I go to this table, let's just

see what it tells us about some of the numbers we know.

Let's go from 1.96 just to, to, to be exact when we're looking at this table.

So the way to follow this is you see its got this column here called Z,

And this goes in tenths, intervals of a 1 10th of a number.

And then this other column that goes in hundredths.

And the way to piece this together, is that the

root of the number we're looking for 1.96, we're going to look

for the value 1.9 in this column here, and then where it

intersects the value of 0.06 in the column over here.

So if we look at this, if we go 1.96, and I'm just going to

circle this and highlight it, the value we're given here for 1.96 is 0.4750.

So let's see if that makes sense.

Remember 1.96 is the number we say literally cuts off.

95% in the middle.

Does this information jive with what I've told you.

Well let's see what we're looking at.

What are we looking at with this number?

We are looking at, here's a normal curve, sorry about the

slant there but, and it's telling us that within a normal curve.

If we go 1.96, or you can think of it as two standard deviations from the mean, two

positive 1.96 deviations, that cuts off .475, or 47.5% of that cure.

Does that jive with what we've said? Well let's think about this.

By the symmetry of the normal curve. If we actually go

1.6 standard deviations below the mean, what should that area be?

That should also be 0.475 or 47.5%

and the sum of these two is 0.95 or 95%.

Okay, so if we have this one piece of information

about the upper half, encapsulated by going that far above the

mean, we have the story for the rest of the curve.

And now we can also figure out, you know, the 5%

remaining thread areas is equally distributed so this would be 2.5%.

And you can do this for any value.

And we'll look at some other values in our next lecture set.

And then here's another exhibit,

here's another table we got. Okay, just by searching the interwebs.

And you'll see what this tells you, you're

going to pay attention to what the table's telling you

is, for a human standard deviation value, it's

telling you something slightly different than the previous table.

Instead of telling you how much falls between that.

And the mean, it tells you what percentage of the

curve or values are below that number of standard deviations

away from the mean. Okay?

So let's see if we could use this.

So here's a, here's first snippet but it's still kind of hard to read.

So, let's cut to this, just to give you an example if we were looking at so, five.

From this table here, we have that Z column.

So, this is the similar to the previous table, and then the hundredths unit here,

so we can get down to the second decimal place in terms of standard deviations.

So,

I'm just blowing this up here to look, if we

wanted to look at the story of one standard deviation.

I'm just showing you a piece of the table.

Where the Z column is at negative 1

because this only has negative values in it.

And hundredths column is 0. So, what is this telling us?

It tells us, if we are under a normal curve, and we

are at, here's the mean, for a 1 standard deviation below the mean,

then the percentages of observation that are either further away

in the negative direction, or less than 1 standard deviation.

bu-, more than one standard deviation below the mean

is 15.87%, or that's what we'll round to be 16%.

And so once we have that, we have

the entire story of one standard deviation, right?

We know by symmetry what the If we went one standard deviation

above the mean, we'd also get 16%, so the total area in these two portions is 32%,

which must mean that's what's in the middle is 100 minus 32%, or 68%.

Okay, so let's just think about this for a minute.

What have we covered here?

We've defined the normal curve, showing that it's symmetric and bell shaped.

We've shown that it can completely be defined by knowing its mean

and standard deviation, and that most of the observations in, of, that, for

[INAUDIBLE]

that follow the normal distribution fall

within two standard deviations to the center.

Although the tails go on infinitely, the majority

of the data, 95% is encapsulated within that range.

We've also gone into looking at how to use a table to find these respective

ranges, and cutoffs, and we'll do some

more examples of that in the subsequent portions.