A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

来自 Johns Hopkins University 的课程

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

238 个评分

Johns Hopkins University

238 个评分

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

从本节课中

Module 2A: Summarization and Measurement

Module 2A consists of two lecture sets that cover measurement and summarization of continuous data outcomes for both single samples, and the comparison of two or more samples. Please see the posted learning objectives for these two lecture sets for more detail.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Okay, so in this section we'll actually talk about applying

some of the principles in the normal distribution we've talked

about before, this theoretical distribution, to sample data to estimate

characteristics of the population from which the sample was taken.

And we'll see that if there's evidence in the sample data that

the underlying population is approximately normal

or reasonably described by a normal distribution.

We can get a lot of mileage out of these properties.

So some of things we'll cover in this lecture is, we'll reiterate

the idea of creating ranges concerning, containing a certain percentage

of observations in data that follows an approximately normal distribution.

Using only the estimate of the mean and standard deviation.

And we're going to focus on the plus or minus two standard deviations to get

the middle 95% of a distribution, but we'll use real data examples to do that.

We're also going to figure out how far any individual data

point under a distribution is from the mean of its distribution in

what we call standardized units.

This isn't unique to a normal distribution but we'll see that it has

special properties, these units have special

properties when dealing with the normal distribution.

And this is sometimes called computing what's called the z-score.

And then we're going to convert

the standardized distances or z-scores to statements

about the relative proportion or probability for

values that have an approximately normal distribution.

In other words, we can translate

these into estimated percentiles for the distribution.

Okay, so let's go back to the

normal distribution, again, it's bell-shaped and symmetric.

It is actually a theoretical probability distribution.

No real data is perfectly described by a normal distribution.

For example, in a true, perfect normal

distribution, even though the majority of the data

is contained within plus or minus two, or

three standard deviations of the center, the range

of values described by this distribution

goes to positive and negative infinity respectively.

Although, things farther away than three standard deviations from

the mean have very low representation on the distribution.

But it is a theoretical distribution, the tails go on forever.

No real data encompasses all values, possibly.

However, the distributions of some data will

be well approximated by a normal distribution.

You know, if we look at the sample of data and we see it's roughly symmetrical and

bell shaped, that's evidence that this data came

from a population that had a roughly normal distribution.

And in such situations we can use these

properties of the normal curve that we fleshed out.

Coupled with the mean and standard deviation estimates from our

sample to characterize aspects of the underlying population data distribution.

Let's start with an example. Here's a sample

of data and it's a clinical population of males and we've got blood pressure

measurements from a random sample of 113

adult men taken from this clinical population.

Okay, and so this a sample data we only have 113 observations, our best estimates

for the underlying characteristics of the population come from our sample estimates.

So, for example, we don't know and we won't know the true population mean

mu but we can estimate it based on the mean of the 113 men we see here.

And the mean of these 113 men, is a 123.6 millimeters of mercury.

Similarly, we don't know how variable the individual measurements are

from male to male in the entire population of interest.

But we can estimate it based on

how variable the observations are amongst these 113.

And that's our standard deviation estimate of 12.9.

Here's a picture of a histogram showing the distribution of these 113 values.

And you can see, if actually superimpose a normal

curve on top of that with the same mean

and standard deviation of these data, it's not a

perfect fit, but it's not a particularly bad fit either.

So I'm going to go on the assumption here that these

data come from a population of values that are approximately normal.

Just one more representation

of data to just reinforce another visual that I like.

Here's the box plot showing these data and you can see, well here's the median of the

data which, because of the distribution is symmetric,

is very close to the sample mean we have.

And you can see here's the 75th percentile and the 25th percentile and these are

roughly equal distance from that center as our,

the smallest and largest observation from the center.

And that's just evidence

in the box plot of roughly symmetric data.

Okay, so I asked you, let's say using only

the sample mean and standard deviation of these data,

and assuming approximate normality, let's estimate the 2.5th and

97.5th percentiles of systolic blood pressure in this population.

So, again the 2.5th percentile as we defined before we

showed that that's the lower part of the interval going

within plus or minus two standard deviations from the center.

So, we can estimate that by taking the mean of 123.6 in our sample.

And subracting two estimated standard deviations.

And we get a lower end point here 97.8

millimeters of mercury and that's our estimated 2.5th percentile.

Only 2.5% of the men in the population have blood pressures lesser than 97.8.

That's what we estimate based

on this sample.

Similarly, we can get the 97.5th percentile by

taking the sample mean and adding two standard deviations.

So, going two standard deviations above that mean,

and we hit the 149.4 millimeters of mercury.

So roughly only 2.5% of males in this population have

blood pressures that exceed this value, that's what we'd estimate.

So we could say,

based on this sample data, we estimate that most, or 95% of the men

in this clinical population, have systolic blood

pressures between 97.8 and 149.4 millimeters of mercury.

I just want you to know, if we were to actually use these data, not apply this

rule, the normal distribution, but actually just line

up the values from largest to, smallest to largest.

And take the observed

2.5th percentile and 97.5th percentile, the observed 2.5th percentile is 100.7.

It's slightly larger than we'd estimate using the

normal rule And the observed 97.5th percentile is

151.2 which again is slightly larger than what

we estimate with the normal, but they're very comparable.

So these things align.

Okay, so suppose for example we want, we were working at

the clinic and we want to use the data from our sample

to characterize the population of all such men who come to clinic.

And moving forward, we wanted to use it to

evaluate individuals and how they fall relative to their peers

in this population, to get a sense of whether they

have higher or lower blood pressures relative to the pack.

So, for example, suppose a patient in our clinic comes in and has

a systolic blood pressure measurement of 130, a male in our clinical population.

You might want to figure out what

proportion of men at the clinic, you might want to estimate what proportion

of men at this clinic have systolic

blood pressure measurements greater than this patient.

Is this guy extreme, or is there a

sizable percentage that have higher blood pressures than he.

So let's think about doing this.

So, we can use the sample mean and standard deviation.

And the assumption of normality to estimate this proportion,

and really what we're estimating here, in our curve here.

Exploding the assumed normality is we're trying

to estimate if this is roughly 130 here.

We're trying to estimate this proportion here using the approximate

normality that we've assumed for these data.

So how can we do this?

Well here's what were going to do.

Remember, standard deviation is defined everything

about relative positioning under a normal curve.

So if we can figure out how far this measurement is

relative to the mean in units of standard deviation, we can

find out how many standard deviations this person is above or

below the sample mean, so how are we going to do this?

Well, its just a conversion of units really and

I'll speak to that in more detail in a minute.

What were going to do here is were going to

take this guy's individual value of 130, and

subtract the mean of everybody in our sample,

which we use to estimate the population mean so.

This guy, in units of millimeters of mercury, is 6.4 millimeters

of mercury above the mean of the 113 men in their sample.

And now to understand how far that is,

relative to the spread in the distribution, because we

can't really say whether that's far above or

not until we consider how variable these values are.

We're going to convert it to standard deviation,

so we're going to take that 6.4 millimeters

of mercury and divide by the number

millimeters of mercury in a single standard deviation.

And we see, if we do this we're going to

get, and I'm going to round here, we get an observation

that is roughly half the standard deviation, above the mean in that sample.

Okay.

So now, when we ask the question, what percentage of men

in the population have blood pressures greater than 130 millimeters of

mercury, it's akin to asking what percentage of observations in a

normal curve are more than 0.5 standard deviations above it's mean?

So let's see if we can figure

that out. So, if we were using a normal table.

We could use a normal table to do this.

You could use a normal calculator online, et cetera.

Again the thrust of this course is not about looking things up in

tables, but just to add depth to this we'll do it for this example.

So I'm going to go to the first normal table I pointed out in lecture

A, and I'll blow this up a little bit so we can look at it.

We want to look at the story of 0.5 standard deviations.

So if I go to the row where my,

tenths value is 0.5.

And the column where the hundredths value is zero,

that's going to correspond to a standard deviation of 0.5.

And the value I'm giving there is 0.1915.

I'm just going to round that to 0.19, or 19%.

Well, what is that telling us though about 0.5 standard deviations in a normal curve.

Well, it's saying, remember we have to read

the fine print or look at the picture.

It's saying,

if we're looking at a normal distribution, if we're at the mean.

We have an observation that's 0.5 standard deviations above the mean.

Or the standardized value of 0.5 to the percentage of observations under that

curve that are between the mean and 0.5 standard deviations is 19%.

Okay? Well, that's great.

Right?

But we actually are concerned about figuring out the percentage of

observations that are more than 0.5 standard deviations above that mean.

What we have here is the percentage

within 0.5 standard deviations only above the mean.

So, how could we figure that out?

Well, we can work the symmetry element nicely, right, so the curve is

perfectly symmetric as we know, that means that 50% of the observations are.

Below the mean and a total of 50% above the mean, so this captulates

19% of this half, that means that the remaining part in this upper half is 31%.

In other words, 31% of observations under a normal curve are

more than 0.50 standard deviations above the mean of that curve.

So, what we've in essence done is we first asked the question what proportion

of men have blood pressures greater than

130 millimeters of mercury in this population.

We've translated that into a statement about the

number of standard devs this is above that mean

0.5 and we figured out that the proportion of men who fall in this upper tail is 31%.

Another way to interpret this is an estimated probability.

We ccould say that the probability that any male

in the population, has a blood pressure measurement greater than

130, that is more than 0.5 standard deviations above the mean, is 31%.

So when you think about this, this

guy who's blood pressure measurement is 130 isn't

particularly unusual relative to the entire distribution of

31% of men have values higher than he.

So ultimately, this estimate of 31% of the males in the

population having blood pressures greater than 130, you know we've estimated

here is essentially the 69th percentile.

If 31% of the males have the blood pressures greater than a 130,

means that the remaining 69% have values less than or equal to 130.

And so we have estimated the 69th percentile this distribution to be 130

millimeters of mercury, using only the mean and standard deviation of our sample.

If we were to do it the more laborious way and line up these

values, 113 values from smallest to largest

or use the computer to pick off, I'll

pick off the 70th percentile to round of

the observed values is actually equal to 130.

So we're here, we're seeing this, this

normal approximation line up nicely with the

empirical percentile and that's because the normality

is a reasonable assumption given these data.

So this type of computation we did to convert to systolic blood pressure

of 130, to the number of SDs above, or it could

be, have been below, but it was above in this case.

The sample mean is sometimes called computing what's called a z score.

And z is sometimes the letter used to refer to a normal curve.

But really there's nothing special about a z score.

Don't let this nomenclature fool you, it's, it's really nothing special, it's

just a measure of the relative distance and direction of a single observation

under a data distribution, relative to the center of the distribution.

And this distance is converted to unites of standard deviation.

And there's really, this is a akin to

exchanging currency, or converting distances from kilometers to miles.

Taking something in one set of units,

comparing to a reference point and changing it

to units that your comfortable with or

that are comparable, for what you're looking at.

Se let me give you sort of a silly example, a parallel

example for giving a z score, just to

remind you that there's nothing special about this.

So, suppose you're an American, apartment hunting

in, we'll just say, an unnamed European city.

And so you wish to find you know, we're all at the

school of public health, so you want an apartment within walking distance.

And you consider walking distance, reasonable walking distance to be plus or

minus 1.5 miles of the large organic supermarket in the city, okay.

So, I'm dumbing

down the geography of the city

and available options just for illustrative purposes.

This supermarket is on the main boulevard which runs east

on west and you're only considering apartments on this main boulevard.

So a pretty restrictive example, just to set the stage, okay?

So you're an American, when you were

in school they stopped teaching the metric system

and so, you find out the supermarket is

two kilometers west of the main city square.

Okay, so that's piece of information in a unit that you don't fully understand.

And then, you're interested in three apartments that you found listed online.

In apartment one, for example, it's 6 kilometers west of the same city square.

We go well, what does this mean for me in walking distance to the supermarket?

Well, what you deductively have to do is compute the

distance between this apartment and the reference point of the supermarket.

Filtered through, I'm sorry, the reference point of the city square, you have

to take that and convert it to a unit that you would understand.

So in the US, our

distance measure that we're comfortable with is in terms of miles.

So this is a picture of what we're looking at so far.

This is very basic city planning [LAUGH] on my part.

Most European cities are much more interesting laid out than this.

But what we're looking at here is, here's our boulevard, it runs

from west to east, here's the supermarket, and it's two kilometers below.

It's two kilometers west of the city square.

Okay? So here's a picture

of what I was describing so far, and it's kind of unimaginitve

city planning on my part but it's just to illustrate a point.

So here we've got, in this picture we've got the city square

here, and then two kilometers west of it we have the supermarket, right?

Okay?

So let's now bring in apartment one.

And we want to figure out how far apartment one is from

the supermarket, but the only information we have about distances apartment

one is relative to the same reference point of the city square.

So what are we going to do, how are we going to figure this out?

Well, maybe your first instinct would be to figure out how

far apart one is from the supermarket in terms of kilometers.

And we can see pretty simply that the raw distance, if you will, in terms of

kilometers we would take the six kilometers, the

apartment is from the city square and subtract

two kilometers.

The apartment is west, the supermarket is west of the city square and

we get a total distance of four kilometers between our apartment and the supermarket.

And that's great.

We've actually subtracted off the same reference

point here, and this kind of subs in

for the mean of a distribution, the city

square's the reference point for comparing things to.

And the question of course, is, well, four kilometers to me,

as an American, I'm not sure what that

means in terms of actual distance so what I'm

going to have to do is convert it to

miles, so how would we convert kilometers to miles?

Well we have to change the units.

This is what we're doing when we convert

things to standard deviation under a normal curve.

One mile's equal to 1.6 kilometers. So four kilometers, the way we

would convert it is, we take the distance in kilometers between our apartment and

the supermarket of four kilometers, and divide it by,

we divide it by that 1.6 kilometers per mile.

So what we get is the distance in miles is 2.5 miles.

And now I can see that this apartment does not fit my

criteria because it's clearly, it's 2.5 miles below, or west, of the supermarket.

It's farther than my desired range of 1.5 miles, so I can rule out,

now that I've made this conversion, I can rule

out this apartment as something I want to look at.

If you did the same thing for the other apartments.

You could see things like

where the other apartments fall, relative to the supermarket.

And you could see something like the apartment

number two falls 0.78 miles in the other

direction, and if you do the conversion in

apartment three it's 1.88 miles in the other direction.

So the only thing that would meet our criteria,

after we converted these to standard units, is apartment two.

But, looking at face

values I wouldn't have gotten that because

I don't understand kilometers as a distance.

So in some sense, the z score, this standardized measure of

distance, is like the statistical mile, or you could say, statistical kilometer.

Whatever distance metric you're most comfortable with.

Could say it's the statistical dollar, or the statistical

Euro, the statistical Rupee, whatever currency you're comfortable with.

It's something that we can convert things with different units, on different scales

to comparable units and compare their relative

distances from the center of their distributions.

So, when dealing with data that follow

an approximately normal distribution, these z-scores tell us

everything we know about the relative positioning of

individual observations in the distribution of all observations.

They define the percentiles of our distribution.

We can also compute z-scores for data arising from any type of distribution.

And it's not specific to the normal.

However, data from non-normal distributions, it will inform us about

the relative position, relative to other data points under the distribution.

But we'll see in section c, with non-normal data, this may

not help us specify or quantify the percentiles of the distribution.

So this mapping of disease scores to

percentiles works well in a normal distribution.

Won't necessary

hold for others.

Okay, so let's look at another data set, data example.

We've got data and we'll become quite familiar

with this data as we go along, too.

The data on 236 Nepali children.

Who are one year old at the time of an NNIPS2 study which is

the study done in Hopkins, it's the

Nepal Nutritional Intervention Project and in Sarlahi, Nepal.

And so they took a large, this is in

the context of doing a Vitamin A study but they

looked at children and they, they took measurements in baseline on the

children and here we have 236 of the original sample were 12 months-old.

What we have here is a distribution of the weights of the children in the sample.

And of these 236 children, the sample mean,

at 7.1 kilograms, and that's our best estimate

of the true mean for all one year old Nepali children at the time of the study.

And the estimate of the variation

in weights among these one year old children our estimate of sigma that a true

standard deviation is taken by looking at

the variation in these 236 observations we have.

We don't know the true variation of the population because

again we only have 236 observation from that larger population.

But we can estimate it with our sample standard deviation.

And let's just look, look at this distribution here,

and you know, I mean, again, there's some subjectivity here,

but I would say that this is pretty well, described

by a bell-shaped symmetric curve whose tails drop off quickly.

And if we fit a normal distribution to these data, with the same mean

and standard deviation of the 236 observations,

you can see it's a pretty good fit.

Slightly higher proportion in the center than

we would expect with a true normal curve.

But again, you know, approximations go this is pretty good.

So, let's use only the sample mean and sample standard deviation.

Assuming normality or approximate normality.

Let's use only the sample mean and

standard deviation to estimate a range that contains

weights for most of the middle 95% of the Nepali children who are 12 months old.

So essentially, a range that goes from the 2.5th percentile to the 97.5th percentile.

And we've, under the assumption of normality,

we can use our old tricks about two standard deviations and the middle 95%.

So the lower bound for this interval would be

the 2.5th percentile, which we can estimate by hand

by taking the mean of 7.1 kilograms, and subtracting

2 times the standard deviation, which was 1.2 kilograms.

And if we do this, you can verify my math, we get 4.7 kilograms.

And if we do the upper, 97.5th percentile, we

do the same operation but instead of subtracting we add.

We take the 7.1, the mean of all 236

children and add 2 times that standard deviation of 1.2.

And if you do this, the upper bound is 9.5 kilograms

and you can see here right here it is in easier to read text than my handwriting.

So what

do we say here?

Based on this, this is just an estimate based on

the sample data but we estimate that most of the

Nepali children who are 12 months old at the time

of the study, had weights between 4.7 kilograms and 9.5 kilograms.

Only roughly 5% of the children fell outside that range.

And I just want you to note, so these

are estimates of percentiles based an a normality assumption.

If we would actually calculate empirically

the percentiles from these data.

Line up the 236 weight measurements from smallest to

largest, and then pick off the 2.5th and 97.5th percentile.

What we get for the 2.5th percentile is 4.4 kilograms, so it

matches pretty closely with what we got using the mean and standard deviation.

And for the 97.5th percentile of these observed values is 9.7,

which is very close to what we got with that normality assumption.

So, suppose we, we've got this to sort of

characterize the population of these children at one year old.

We got some information based on these 236

and we've estimated characteristics to the population distribution.

The mean and standard deviation.

And we're assuming normality, or approximate normality of these data.

Suppose a mother comes into the clinic, or comes to a pediatrician, for

the 12 month check-up and wants to

evaluate where her child's weight is relative

to the population of 12 month olds in Nepal.

In other words, she wants the weight percentile for the given age.

And not only will this tell her where the children are relative to

other one year olds, or 12 month olds in Nepal, but she can

also compare it to where this child fell and compare it to other

distributions of other ages to see if they're growing at the same percentile.

Or they're falling off or increasing on their growth curves.

So her child

is five kilograms in weight.

So how does this child compare in weight to

the weight of all 12 month olds in Nepal?

Well, let's figure this out using the techniques we've talked about.

Okay.

So, again, here's this distribution of sample

data with the theoretical normal curve superimposed.

And what we're trying to figure out here is explaining

this approximate normality is what we're trying to figure out.

And see here is roughly five kilograms,

we're trying to figure out what proportion of children In the population

in which these data come is, have weights lower than five kilograms.

Okay, so how are we going to do this?

Well, if we translate this measurement of

five kilograms to units of standard deviations, we

can find out where this child's weight compares to the mean of all such children.

So this child is five kilograms, let's just

do this computation, the z score if you will.

We're going to take the mean of

7.1 kilograms and subtract that from our observation and then divide

by the units, by the number of kilograms in a standard deviation.

And so what we get here is, when all the dust settles, you know,

this child weighs 2.1 pounds less than the average weight of children in the sample.

So they're distance in kilograms is negative 2.1.

And if we standardize that, we get

a result that is approximately equal to negative 1.75 SD.

That means that this child's weight, when compared to the other

children in the sample, is 1.75 standard deviations below the mean.

That's why it's a negative value.

Okay?

So this question, one way to actually ask, how does this child compare to all others?

One way to talk about this and explain it to

the mother is say, well we can tell you what percentage

of observations in a normal curve are

more than 1.75 standard deviations below its mean.

Which is the technical way of asking what percentage of children in

this population have weights less than five kilograms, okay, let's do this.

Well, again, we've said this is just in hard

typed text, what I did before in the previous slide.

Okay.

So now we have to answer the question, what percentage of observations

in a normal curve are more than 1.75 standard deviations below its mean?

And otherwise, what per-, what's the probability of getting an observation

that's more than 1.5 standard deviations below the mean of the curve.

In other words, what percentage had a z-score or what's the

probability of getting an observation whose

z-score is smaller than negative 1.75.

So again

we're going to use this table, just because

I've adopted this as my favorite, but you could

use other tables which give you slightly different takes

on the same story to get the same answer.

So, now I'm just going to cut to a zoom of the row

where the starting root value for the z-score in this table is 1.7.

And then we'll [UNKNOWN] on the five

100ths here and this proportion we get here.

This proportion that's shown here is 0.4599, or I'll just round that to 0.46.

Okay, so what is that telling us about 1.75 and the normal curve?

Well, it's telling

us, it's not quite what we want to know.

It's telling us for this particular table, that

if we're at the mean and the percentage

of values that are contained between the mean

and 1.75 standard deviations above the mean is 46%.

What we really want to know is

what proportion, we're going to look at the other side, we want to know

what proportion or percentage are below, lesser

in value, than negative 1.75 standard deviations.

In other words, more than 1.75 standard deviations below the mean.

Well, how can we figure this out?

Well, let's just work the symmetry of the curve.

We know that, again, the entire upper half

above the mean constitutes 50% of the observations.

46% of them fall between the mean and 1.75 standard devs above.

So this remaining proportion here is 4%. Okay.

And by the symmetry of the normal curve, that means if

4% of the observations are above 1.75 standard deviations from the mean.

Then by symmetry, also 4% are below 1.7

standard, seven five standard deviation from the mean.

So we've got our value.

That okay?

So our answer here, if we were talking to

the mothers, said well, you know, in this population only

4% of the children, we estimate that only 4%

of children have weights lower than your son or daughter.

And that may be fine.

The children may be lower weight and then the mother would just want to compare this

percentile to where the child's weight fell

for other ages leading up to this visit.

Okay.

So in other words, this, this, this estimate we got, a 4% fall

below five kilograms we've estimated so, you know, 4% or below.

That means that the remaining 96% are above this weight.

Then, this five kilograms is an estimate based,

using the only means standard deviation estimates,

the 4th percentile of this weight distribution.

And just for comparison purposes, if I actually went to the entire data-set of

136 236 weights, use the computer to line them up, you know, well the

computer doesn't do this, it just gives me the answer but essentially lines them

up from smallest to largest and picks off the 2.5 or the fifth percentile.

The fifth percentile in this, in these data is five kilograms.

I didn't take the fourth percentile, I could

have, but fifth percentile is more commonly quoted.

You can see that aligns very nicely five

kilograms is the fifth percentile of the observed data.

And when we use this estimate based on the mean

standard deviation, we estimate five kilograms to be the fourth percentile.

So those line up very nicely, which we would expect, given the approximate normal

nature of these data.

Okay, we could also answer a broader question about this child who weighed five

kilograms as well, instead of just focusing

on the proportion that are lower weight.

From this child or the probability of getting a child of lower weight than this.

We could ask a broader question, what percentage of 12 month

old Nepali have weights more extreme or unusual than this child?

And here we're going to consider children whose values are farther

then 1.75 standard deviations from the mean, either below, like we looked

at before, but we're also going to bring in the piece above.

So, another way of asking this is, what percentage of

weights are farther than five kilograms in either direction from the

mean of all children is asking what percentage of weights are

farther than 1.75 standard deviations from the mean in either direction.

In other words, those whose z scores are less than negative

1.75 or those whose z scores are greater than 1.75.

Sometimes this could be notationally simply expressed by saying the

absolute value of the z score is greater than 1.75.

Again, we could ask this question in terms of probability as well.

What is the probability that a 12 month old Nepali child will have

a weight measurement more than 1.75 standard deviations from the mean of all

such children above or below.

And what we're doing here is we're just looking at this, this area

here like we did before and then also the comparable area up here.

So, we've already seen this 4%, so by symmetry this would be 4%.

And the total proportion of children who have weights that are more

extreme in either direction than the child who's five kilograms is 8%.

So we can now characterize this child as relatively unusual

as compared to rest of these data.

Only 8% of the children are as far or

farther away than this child in terms of weight.

Okay.

So let's just summarize what we've talked about in this example.

The normal distribution is the completely theoretical probability distribution.

No real data is perfectly described by a perfect normal distribution.

A normal distribution can be completely defined by two characteristics.

It's mean and standard deviation.

Now again, no real world data has a perfect normal distribution.

However, some continuous measures are

reasonably approximated by normal distribution.

We saw two examples in this lecture set.

And again, there's a subject [UNKNOWN] element here by looking at histograms,

et cetera and seeing, do we meet roughly the criteria for approximate normality?

But if we do, we can use these properties to estimate

percentiles for the population from which these date come.

When dealing with samples of approximately normal distributed data,

the distribution of sample values will also be approximately normal.

So if we see evidence in the sample of data that, that it's

approximately normal, it's indicative of coming from

a larger population distribution that's approximately normal.

Right?

And we can use the sample mean and standard deviation estimates

of these sample data to create

ranges considering certain percentage of observations.

Or in other words, estimate the probability that an

observed data point falls within a certain range of values.

[UNKNOWN] all 95% for example.

And we can also figure out how far any individual data point is

from the mean of its distribution in standardized units to compute a z score.

Now, we, we can actually do that for any type of distribution.

But when dealing with normal distributions, we can convert

these z scores to statements about relative proportions and probabilities,

hence percentiles, for values that have an approximately normal distribution.