0:46

It's very easy to understand the difference between these,

especially if you've played darts before.

After two lots of three darts and in my sixth dance

I scored a two, three threes, a 12 and a 13.

Now lets see if we can work out the mean, the median and the mode.

1:54

So for my not very good dart playing, scores were two,

three, three, three, 12, 13.

The mean is 2+3+3+3+12+13

divided by 6, 36/6 = 6.

The mode is 3.

The median since we have an even number of observations,

is 3 + 3, the middle two observations,

divided by 2, which equals 3.

Notice, if the dart player had scored say, 19 instead of 13.

The mean increases to 7, but the mode and the median score is unchanged.

3:12

So far, when we looked at the shape of the distribution

we identified the mode as the value where the distribution has a peak.

And we saw examples when distributions have one mode,

that is a unimodal distribution, or two modes, a bimodal distribution.

In other words, so far we identified the mode visually from the histogram.

4:41

How would you describe these two distributions of exam scores?

Both distributions are centered at 70.

The mean of both distributions is approximately 70.

But the distributions are really quite different.

The first distribution has much larger variability and

scores compared to the second one.

5:02

In order to describe a distribution, we need to supplement the graphical display,

not only with the measure of centre, but

also with the measure of the variability or spread of the distribution.

5:16

>> There are several ways to describe spread.

A commonly used measure is standard deviation.

The idea behind the standard deviation is to quantify the spread of

the distribution by measuring how far the observations are from their mean.

5:38

In order to better understand standard deviation,

it would be useful to see an example of how it's calculated.

In practice of course, the software will be doing these calculations for us.

[NOISE] >> Emergency medical services companies

would like to estimate how many ambulance crews to keep on standby.

5:56

Here are the number of ambulance calls over an eight hour period.

To find the standard deviation of the number of hourly calls,

first we would find the mean of our data.

[SOUND] Next we would need to find the deviations from the mean.

That is the difference between each observation in the mean.

Since our mean is 9 we would subtract 9 from each of our observations.

6:54

>> So why do we take square root?

Note that 16 is the average of the squared deviations and

therefore has different units of measurements.

In this case, 16 is measured in squared number of ambulance calls,

which obviously cannot be interpreted.

7:23

Recall that the average number of emergency calls in an hour is nine.

The interpretation of standard deviation equal to 4 is that, on average,

the actual number of emergency calls each hour is 4 away from nine.

7:47

>> Since we're working with very large numbers of observations

hand calculations of standard deviation really aren't feasible.

Python will do all of these calculations for you, but it's important to know how to

calculate standard deviations so you can make sense of your variability.

For example, looking at a variables distribution in two different samples,

you should be able to tell which has greater variability, that is,

a larger standard deviation.

9:05

This provides a count, mean, standard deviation, minimum and

maximum values and the 25th, 50th and 70th percentile values.

So you can see that describe is extremely useful in better understanding important

characteristics of this cigarettes smoked per month variable.

9:23

We now know that young adult smokers in our sample

smoke on average 320 cigarettes a month.

In that the standard deviation is about 274 we can say that on average,

young adult smokers smoked 320 per month.

Plus or minus 274 cigarettes.

So as you can see, there's an extremely large range in terms of cigarettes smoked,

and a lot of variability on this variable.

Very similar code can be used to calculate many of these statistics individually or

to generate additional descriptive statistics.

Here's additional code for generating the mean, standard deviation,

minimum and maximum, median, and mode of a quantitative variable.

10:11

Note that the count for this variable is 1,697 rather than

the size of our sample of young adult smokers which was 1,706.

This is because Python does not include those cases with missing or

NaN data in these calculations.

10:28

But what if we include a categorical variable when employing the describe

function?

Because we have previously defined TAB12MDX,

our nicotine dependence variable is categorical.

Adding describe syntax provides us with descriptive statistics appropriate for

categorical data.

That is count, number of unique values, the top or

highest value and the frequency of that top value.

If you would have failed to describe this variable as categorical,

Python would still generate descriptive statistics.

However, many would not make any sense.

If you'll recall the nicotine dependence variable's represented with dummy codes.

That is, yes is indicated with a 1 and no indicated with a 0.

As you can see here we've got a standard deviation based on dummy codes of 1 and 0.

Further, percentiles are listed representing yeses and

nos rather than actual quantities.

So again, it's very important to remember to use the appropriate

descriptive statistics for both quantitative and categorical variables.

For quantitative variables it's best to examine histograms, and

then to supplement these with exact measures of shape, center, and spread.

Categorical variables can often be described well with frequency

distributions or with a bar chart.