0:10

Hello. This lesson will introduce descriptive, or summary statistics.

This is an important concept because when you're working with the data,

particularly large data sets,

it's often useful to get a quick feel for how your data is distributed.

The best way to do this is to use descriptive statistics to get

a few measurements that provide you with that information.

Now many of these statistics are likely familiar,

such as the mean, the median, and the variance.

This lesson will focus on calculating and using

those and other related statistics within a Python script.

We'll be using the Introduction to Descriptive Statistics notebook to demonstrate this.

Now we are focusing solely on one dimensional data sets at this time.

In particular you can think of this as

either a numpy one dimensional array or a single column from a pandas data frame.

In either case we want to understand what's the typical value in this data set.

What's the spread around that typical value?

How is the distribution shaped overall?

Is it peaked to one side,

is it angled, slanted,

does it have multiple peaks?

And so we want to find statistics that can give us

measurements of these and other related quantities.

To do this we're going to use a data set that is included as

part of the seaborn Python package.

Seaborn is a visualization library and we'll be learning more about it in future lessons,

but to make the visualizations it includes sample data sets.

This tips data set represents data from people that

visit a restaurant and they may go at dinner time or at lunch.

They go on different days of the week,

the person who pays has different genders,

they have different bills and they give different tips,

and the parties have different number of people in them.

So this is a nice simple data set that you can make statistical analyses on.

One thing that we're going to want to do is extract columns from our data frame because

we're only focusing on a single dimension or a single column at the moment.

To do this easily we can simply say,

take the total bill column from

the tips data frame and extract it as a matrix which means it's going to be

a one dimensional numpy array which we demonstrate

by slicing out 10 elements, as shown here.

The first thing we look at are measures of centrality or location.

These include things like the mean,

the median, and the mode.

We demonstrate how to do this either with pandas,

where we can select a column and compute the mean.

This is actually the arithmetic mean.

There are other means as well including the geometric mean and the harmonic mean.

To use those you actually have to use a different Python library,

the scipy.stats library which includes the geometric and harmonic means.

We also can calculate the median.

The main idea here is that if you've sorted

your data set the very middle element is the median.

If you have an odd number,

it's easy, you actually have a middle value.

But if you have two even values you have to make a choice, how do you compute it.

Because there is no absolute middle value in your data set.

Typically what's done is,

you take an average of the two values on either side.

So in this case it would be the average of two and three or two point five.

In some cases, however,

you want to restrict the median to data that lie within your data set.

So you would either have to choose to take the low value two,

or take the high value, three and there's different ways to do this in Python.

The mode is the most common value in a data set.

We typically see that as a plot where you have

the highest peaked value in your distribution.

We also can compute more robust statistics,

such as the trimmed mean where if our distribution

has values that are far on either side,

or what we call outliers,

we may want to remove those so they don't bias

our measurement and we can do that with a trimmed statistics.

And so we demonstrate this with the scipy.stats module.

We use the computer to compute the mode and we compute the mean via a trimmed version.

And you can see that the trimmed mean is actually very different than your typical mean,

with the bounds here.

We could also demonstrate this with a Python list where we

can use the built-in statistics module to compute things such as the mean,

the median, and the mode.

Here we are using that built-in library to do that.

But the reason I really like this is because you could apply

the mode from this built-in library to categorical data.

So here we have a list of colors,

red, blue, blue, brown, brown,

brown, having to say what's the modal color,

or the most common color in the list?

Well it's brown as you can see.

We can also compute the low, high,

and average medians as shown here.

The second thing that we look at is measures of variability.

This includes different techniques to figure out how spread out are the data,

in particular, how spread out are the data around the mean?

Different things that we can calculate are mean deviations.

This has the challenge that sometimes it's positive and sometimes it's negative.

So what we typically do is take an absolute value or we square that difference.

Either way this involves summing up intrinsically positive quantities.

Now, these are the two most important ones because we'll use these all the time.

And in fact this has a special name,

it's known as an L1-norm,

and the variance has a special name,

known as the L2-norm.

You will be seeing these when we go into using machine learning later on in the course.

Now one challenge with the variance is because we've taken a different and squared it,

any units on our variable are squared.

So for instance if X is measured in length,

the variance will have units length squared which

complicates a comparison of a variance to the actual value itself.

To simplify this, we generally just take

the square root and end up with the standard deviation,

or the root mean square error.

There's some other things that this note book looks at

as well and you should go through them,

but what we're going to talk about next are measures of a distribution.

If we get a location and a dispersion,

or variation measurement, those are only two numbers.

We might want to have a better idea of the full distribution of the data in our data set.

We can do this by dividing the data into

chunks and seeing how these chunks are distributed.

We can divide the data into four,

that gives us quartiles, into five, that gives us quintiles,

into 10 that gives us deciles or into one percent chunks which are percentiles.

Numpy has a function that does this,

it's a percentile and we simply pass in how many percentiles do we want.

So for instance the median is 50 percentile the middle of the way through.

We can also compute quortiles,

we can compute quintiles,

and percentiles in this manner,

and that's what's demonstrated.

Next we can look at weighted statistics.

If we have errors on our attributes or our features

we can actually weight the measurements of

our mean and our standard deviation to account for the fact

that sometimes we have data with higher accuracy or precision,

and we want to weight those such that they dominate the calculation.

And this is what rest of this notebook here demonstrates.

Next we move on to specific shape parameters that go beyond the percentile or quontile.

These are things that are known as moments of the distribution.

The first two moments would be the mean and the standard deviation, or variance,

but we can actually calculate others that are

the third and fourth order moments called the skewness and the kurtosis.

And these measure the symmetry with respect to the mean value or

the spread or how peaked the distribution is relative to a normal distribution.

Now the last thing I want to talk about is something that can sometimes cause confusion.

Typically you're given a data set and we think about it as a data set on its own.

But in reality what we typically have is a large population and

we've taken some of that data out and we analyze it and we call it a sample.

So for instance, the tips data set is not all of the restaurant's data,

it's a subset of all of the restaurant's data and we're analyzing that.

So we have to be cognizant of the fact that if we're

trying to use measurements of the sample

to infer properties of

the general parent population

we have to be cognizant of things being a little bit different.

So for instance we calculate the average or mean value

of the parent from the sample in the standard way,

but when we calculate the variance we have to use

the fact that we've effectively reduced by

one the number of data points to actually infer something about the parent population.

So typically with the variance we would subtract n-1,

or the same with the standard deviation.

Now we could do this in Python by simply using the ddof and passing at one.

This tells the standard deviation function we want to calculate

this sample variance with the n-1 in the denominator.

Now often, we are using

very large data sets and the difference between dividing by 1,000,000 or

999,999 is minimal and we often can ignore the difference in practice.

But it is important to be aware of if you're dealing with smaller data sets,

say something on the order of 50 or 100.

So that should give you a pretty good introduction to

the idea of descriptive statistics and how we use them.

If you have any questions let us know in the course forum. Good luck.