0:05

So, how do we go about using these probability functions to characterize

the extent of uncertainty and the distributions,

from which our random variables can occur?

Well, first question we're going to ask ourselves is,

what kind of data are we dealing with?

Is it context where there are specific values that,

it's a small number of values, and only those particular values can occur?

Or are we dealing with a range of data, and any value in that range is possible?

That's the case of the discrete versus the continuous.

0:38

Once we've identified the kind of data that we're dealing with,

we're going to choose an appropriate distribution.

And, often times, what we're trying to do, is come up with an approximation.

We're not trying to identify exactly the right distribution,

we just want to be able to approximate the shape of the data that we're observing to

a reasonable degree of accuracy.

1:15

And once we've estimated those parameters, we want to go back and say,

here's what I'm predicting, here's what I'm approximating,

using this probability distribution, does it actually fit my data?

Is it a reasonable approximation for the data that I'm working with?

1:32

All right, so,

let's talk a little bit about the characteristics of a normal distribution.

This is the familiar bell curve, it's a symmetric distribution,

there's a mode in the center.

The two parameters that we need are the mean and the standard deviation.

And it's convenient to use this, because we can use Excel, or

other statistical tools, to calculate the average of our data.

And to calculate our standard deviation, and

those are the parameters that are used to characterize that normal distribution.

Chances are, you've seen the normal distribution applied previously,

if you've done any work with linear regression.

Whether it's financial analysis, forecasting work,

marketing mixed modeling, you're often

making the assumption that the data does come from a normal distribution.

2:18

Linear regression is based on using a normal distribution.

If you've done work on statistics where you're doing sampling,

sampling is based on the normal distribution as well.

So for example, if we're calculating confidence intervals,

that's based on assumptions from the normal distribution.

So, again, chances are you've seen it,

even if you weren't aware of what the normal distribution was,

what the parameters were, what the exact equations are.

2:46

Just to give you a sense for what the normal distribution looks like,

the blue and red curves on this plot are two different normal distributions,

both have a mean of 10, but where they differ, is in their standard deviation.

The blue curve has a much tighter distribution, so

a lower standard deviation compared to the red.

And so, you see that it's much more clustered around its average,

whereas the red has a lot more dispersion.

For anyone who's interested,

this is the mathematical equation to produce that normal distribution.

3:23

Ultimately, what the software is doing in the background,

we'll take a look at this when we actually try to fit normal distributions to data,

is, it's looking for the best values of mu and

sigma, looking for the best values to fit the data.

3:41

Right, for the time being we're going to take the values as given, but

that's the mathematical expression underlying this curve.

All right, there's a very nice property of the normal distribution

that relates to dispersion, we can think of it as an empirical

rule where it relates to 68%, 95%, 99.7%,

within 1, 2, and 3 standard deviations of the mean.

That's how much of the data, or that's how much of the distribution,

is contained in a particular range.

So if we're to look at the range, plus or minus 1 standard deviation,

68% of the distribution is contained in that range.

We go out to 2 standard deviations, 95% of the data is contained in that range.

Go out to 3 standard deviations,

99.7% of the data contained within plus or minus 3 standard deviations.

Doesn't mean that you're never going to see an observation outside of that,

but you've only got a 0.3% chance of seeing that.

All right, this is a little bit of a cheat sheet for you,

when it comes to the normal distribution and Excel.

4:53

All right, so

let's walk through a couple of the commands that we have available to us.

If we know the mean and the standard deviation of the normal distribution,

if I want to know what's the probability of observing an outcome

less than a particular value, so, this shaded region here.

How likely am I to observe a value coming from the normal distribution less than k?

5:19

Well, that's where the command =NORM.DIST is going to be used.

In terms of what you have to input into Excel, =NORM.DIST(,

you're going to specify what the value of k is,

you're going to specify the mean and the standard deviation.

And, if I'm interested in the mass that is less than or equal to the value of k,

I'm going to include the statement TRUE.

If I don't include that statement, or I specify FALSE,

what that's giving me is just the height of this function,

and that's not going to be of too much interest to us.

All right, so that's if I'm looking for less than a particular value.

Another way that we might look at it, though, is I want to find that value.

For example, I want to find the value of k, such that,

let's say, this region corresponds to 5%.

Well, if that's the case, I'm going to use the norm inverse command, or

NORM.INV, where I'm going to specify that percentile, the 5% level,

the mean, mu, and the standard deviation, sigma.

And that's going to return the value of k.

So, these two functions, essentially going to be the opposite of each other.

One of them says, you give me the value of k, I'll return for you that probability.

The other says, you give me the probability, I'll return for

you the value of k.

6:46

If you're dealing with, what's referred to as, the standard normal distribution,

mean zero standard deviation of 1, yeah, it's a little bit easier.

We use the standard normal, NORM.S,

for both the distribution command, and the inverse command.

7:04

Another convenient feature,

just to make you aware of it, is the STANDARDIZE command.

If you have If we have data that we want to standardize turn it

in to a standard normal distribution, what this calculation is going to do for

us is subtract the mean,divide by the standard deviation.

So it's going to standardize this data, by standardizing this data,

if you recall back to any statistics classes you might have taken,

it's going to correspond to those z tables that we

used to have to look up in the back of the statistics textbooks.

7:42

All right, so if we're trying to predict demand for a particular flight route,

and it turns out that historically it's followed a normal distribution,

mean of 500, standard deviation of 100.

All right, well, an airline might want to know how likely is it,

that if I allocate 600 seats, that's going to be enough to meet demand?

All right, so,

how likely is it that we've offered enough seats to meet people's demand?

In this case, we're going to use the 68-95-99.7 Rule to come up with this.

All right, so let's think about what this involves doing.

Demand of 600 is 100 units, or

1 standard deviation more than the average.

All right, well the chances of that average being enough,

well that's 50%, because the normal distribution is symmetric.

All right, so there's a 50% chance that demand is less than the mean of 500, so

now we've gotta figure out, how likely is it that demand is between 500 and 600?

Well, that's where the 68-95-99.7 Rule comes into play.

68% percent of the distribution is contained within 1 standard deviation

of the normal distribution, that's a characteristic of the distribution.

So minus 1 standard deviation to plus 1 standard deviation is 68%.

What we want is from 0, the mean, up to 1 standard deviation,

essentially half of that width.

Well, half of the 68% is going to be 34%, so

I've got 50% chance that demand is less than 500, and

I've got a 34% chance that demand is between 500 and 600.

Combined, that's going to tell me 84% chance that demand is

going to be met by offering 600 seats.

All right, so again, with the visual to work with,

here's our center point, this is 50%of the distribution.

Between 500 and 600, or between 400 and

600, we got a total of 68%.

So between 500 and 600, this is going to be 34% of the distribution.

10:12

Okay, another way that we can look at this, and

hopefully this is something that we never have to do going forward,

given that we've got Excel available to us, or other statistics programs.

But we could try to look up the z values that correspond to particular things.

The z value of 1 means, how much of the distribution

is contained below 1 standard deviation above

the mean, and it corresponds to that 84%.

All right, so if I were to look, again, the first decimal of

the z score over here is 1.0, second decimal over here, and

that's where I'm just looking up that 84% value from.