0:01

In this video we will introduce you to the normal distribution and

discuss some of its properties, such as the 68–95–99.7% rule.

This motion's going to make sense in

a little bit when you see what we're talking about.

And we're also going to introduce standardized course, commonly known

as Z-scores, and we're going to give examples of working with the Z-scores to

find probabilities and percentiles under the normal distribution curve.

Many variables in nature are nearly normally distributed.

A commonly used example is heights.

We're going to take a look at a distribution of recorded heights

of members of an online dating website, OkCupid.

0:41

Since members of this website are US residents and

likely represent a random sample from the US population, we

would expect their heights to follow the same height distribution of all Americans.

However, a closer look shows that, that's not exactly the case.

In this plot, the light purple curve shows the distribution of heights of US males.

The dotted line represents the distribution of heights reported by males

on OkCupid.

And the dark purple solid line is the implied

distribution of heights of these men, so the men on OkCupid.

We can see that heights reported by men on OkCupid very

nearly follow the expected normal distribution,

except the whole thing is shifted to the right of where it should be.

It appears that males on OkCupid add on average a couple inches to their heights.

8:18

Standardized scores are also useful for identifying unusual observations.

Usually, observations with absolute Z-scores above 2,

so that's either 2 standard deviations below, or above the mean or

something beyond that, are considered to be unusual.

While we introduce Z-scores within the context of a normal distribution,

note that they're actually defined for distributions of any type.

After all, every distribution will have a mean and a standard deviation,

therefore for any observation whatever distribution the random variable follows,

we could calculate a Z-score.

But we're going to talk about why we brought this up within the context

of normal distributions in a moment.

9:12

Percentile is the percentage of observations

that fall below a given data point.

Graphically it's the area below the probability distribution curve,

to the left of that observation.

So why is it that we can only use the Z-scores under normal curves, but

not in a distribution of a different shape?

Well we can always calculate percentiles for any sort of distribution,

except if the distribution does not follow this nice unimodal symmetric normal shape,

you'd need to use calculus for that.

And for the purposes of this course, we're not going to be using calculus, so

therefore we're going to be sticking to normal distributions for

calculating percentiles or areas under the curve.

In this day and age, percentiles are easily calculated using computation.

For example, in R, the function P norm gives the percentile of an observation,

given the mean and the standard deviation of the distribution.

So P norm of negative 1, for a distribution with mean 0 and

standard deviation of 1 is estimated to be about 0.1587.

We can also obtain the same probability using a web applet, so

no need for access to R to use this one.

So let's go to the URL that's on the slide to the web applet and

do a live demo of how we would use the applet to calculate this percentile.

So to use the applet the first thing we do is to select our distribution to be

normal.

We can change our mean as we desire,

but we're going to leave it that 0 since that's the distribution,

the standard normal distribution we're working with for now.

We could also slide our standard deviation around but let's leave that at 1 for

now as well.

And we were interested in the area under the curve below the cutoff

value of negative 1, and we want to pick the lower tail here, and

once again we get to the same answer, 15.9%.

Lastly, we can also avoid computation altogether and

use a normal probability table.

We locate the Z-score on the edges of the table and

grab the associated percentile value given in the center of the table.

So, for a Z-score of negative 1 we look in the negative 1.0 row and

0.00 column for the second decimal and

arrive at the same answer, 0.1587 or roughly 15.9%.

Obviously, we don't have to keep using all methods here.

We've talked about three different methods using R, using our web applet, or

using the table.

You're welcome to use whichever you like in your calculations.

While the computation approach is a little less archaic,

the tables are actually very useful for

getting a conceptual understanding of what we mean by area under the curve.

So I encourage you to use the computation or R approaches.

But for the time being as you're learning this material,

also make sure that you get a chance to interact with the tables and

make sure that you sketch out your distributions.

And don't just rely on the numbers that the computer is spitting out at you but

make sure that you confirm them by hand as well.

Let's take a look at a quick example.

We know that SAT scores are distributed normally with mean 1,500 and

standard deviation 300.

We also know that Pam earned an 1,800 on her SAT and

we want to find out what is her percentile score.

Soon as we find out that the distribution is normal, the first thing to do is to

always draw the curve, mark the mean, and shade the area of interest.

Here we have a normal distribution with mean 1,500, and

to find the percentile score associated with an SAT score of 1,800,

we shade the area under the curve below 1,800.

We can do this using R and the pnorm function.

So here, the first argument is the observation of interest.

The second argument is the mean.

And the third argument is the standard deviation,

which spits out an associated percentile of 0.8413,

meaning that Pam scored better than 84.13% of the SAT takers.

13:25

First, we calculate the Z-score, as the observation, 1,800 minus the mean,

1,500, divided by the standard deviation, 300, the Z-score is 1.

Remember, we actually saw this before.

Then in the table, we look for the Z-score of 1, the row is 1.0, and column is 0.00.

And get the same probability, 0.8413 as the probability of

obtaining a Z-score less than 1, which basically means the same thing

that the shaded area under the curve, below 1,800 is 0.8413.

14:09

Note that both the table and the pnorm function

always yield the area under the curve below the given observation.

If we actually wanted to find out the area above the observation,

we'd simply would need to take the complement of this value

since the total area under the curve is always 1.

So Pam scored worse than 1- 0.8413 which

amounts to 15.87% of the test takers.

We can also use the same properties of the standard normal distribution,

in other words the distribution of Z-scores to find cutoff values

corresponding to a desired percentile.

Here's an example illustrating this,

a friend of yours tells you that she scored in the top 10% on the SAT,

what is the lowest possible score she could have gotten?

Remember, SAT scores are normally distributed with mean 1,500 and

standard deviation 300.

We're looking for the cutoff value for the top 10% of the distribution.

This is a different problem than the one we worked on earlier,

as this time we don't know the value of the observation of interest.

But we do know, or at least we can get its percentile score.

Since the total area under the curve is 1, the percentile score associated

with the cutoff value for the top 10% is 1 minus 0.10, 0.90.

Remember that the formula for the Z-score is observation- mean/standard deviation.

And we know the mean, we know the standard deviation.

16:20

We know that this number 1.28 is equal to the unknown observation.

We're calling it X here, minus the mean divided by the standard deviation.

A little bit of algebra, multiplying both sides by 300 and

adding 1,500 and we find that the cut off value is 1,884.

So the cutoff value for the top 10%, or the bottom 90%,

of the distribution of SAT scores is 1,884.

In other words, if you have scored above 1,884,

you know that you're in the top 10% of the distribution.

We could also do this using R, and we're going to use the qnorm function this time.

So pnorm for probabilities, qnorm for quantiles or cutoff values,

which takes the percentile as the first input, the mean and the standard

deviation as the second and the third, just like the function we saw earlier.

And the result is the same with either approach, 1,884.