0:29

There's two activities with this particular lesson, the first is to

read about techniques to improve a simple visualization by removing chart junk, and

lastly, going through the introduction to data visualization notebook.

So first, let's take a look at this article.

This is a really interesting article that talks about using Edward Tufte's

concepts of data-ink and how we want to reduce the amount of data-ink

in a article into a visualization to better convey information.

So what this does is it actually has a really neat animated gif that starts with

a dataset that's presented in one way and goes through removing

data-ink to more clearly highlight the important concepts,

and so you can see that it goes through, here's the original graphic, and

eventually ends up with something like that.

1:19

Now what about making these visualizations in Python?

We're going to look at three and

several alternatives on these techniques for visualizing data.

The first is a rugplot and this is a simple way to look at the distribution of

a one-dimensional data set.

We're going to use the tips dataset that we've been using before, and

we can simply call Seaborn rugplot and we get this plot here.

Now we could do a better job.

We could make this look better and we could do so

by simply making a few additions, by employing that plot lib techniques, so for

instance we can make it a true one-dimensional plot.

We can label the x-axis, we can label the title,

and we can change the color and thickness of lines as necessary.

So, this shows you the distribution of data.

While the average might be out here at say, 25, you can see there's a lot

more of the data down here, so it's a somewhat skewed distribution.

2:11

We can also compare 2 datasets directly by comparing their rugplots next to

each other, and

we're going to do this by making two axis on the same Matplotlib figure.

To do this, we simply say plt.subplots and we say in this case, we want two rows and

one column, and moreover, we're going to have the two rows share the same x-axis.

The reason we do this is that in order to compare two data sets over

the same range, we want that x-axis to be shared in common.

Having done that, here's our two rows and our one column.

Here's our figure size.

We can compare these.

Now, notice what I've done is made lists with two colors and

two titles so that we're going to iterate through our two datasets.

I've pulled out the tips for lunch and the tips for dinner at the total

bill column in particular, as a matrix, and I've appended it to this list.

So our list now has one Numpy array, our total bill for

lunch, and the second element is the total bill for dinner.

We're going to iterate through this dataset and make a rugplot.

We're going to clean the rugplot up as we did before.

These will now be displayed together and you can see the lunch more clearly skewed

to lower total bills, our dinner has a much wider spread, and

intuitively, that makes sense, but this visualization really shows that.

If we wanted to we could add vertical lines for where the average dinner time

total bill is and the average lunchtime, and clearly see those differences.

That shows the power of a visualization to convey information.

3:52

The second plot we want to do is called a boxplot.

A boxplot takes the ideas of a rugplot, showing that one-dimensional data set.

But it actually provides the quantile information,

as well as outliers, directly in the plot.

So this is probably easier shown than discussed.

So here we go.

Here's our boxplot for the total bill, for not separated out, at all.

The notch shows the median, the 50%, the box spans the middle 50%, and

these whiskers show effectively the min and max range of the data.

Now the algorithm has a way of identifying outliers, and

it shows those outliers as dots.

So here we can see that there's some data that's really high total bill

that was not included in the quantile analysis.

4:42

So you can see the boxplot, the same information as the rugplot,

but very simply shows where the data's concentrated between roughly about

$12.50 and about $24 is where most of the total bill.

The span is skewed to the right, if you will.

There's more data to the right of the median then to the left.

That's pretty impressive for a simple visualization.

Now we can do the same thing, but split the total bill by the time, and

this code here shows how to do that, and here we go.

Now we have the lunch and we have the dinner,

and you can see that the median dinner bill is higher than the lunch bill.

The dinner also has a much wider spread than the lunch bill.

Very quickly allows us to compare these two datasets.

5:25

Now sometimes you want to show this in a way that compares the datasset,

but not 2, but more, and we can do that by turning the boxplot on its side.

Here, we are now breaking out the total bill by day of the week.

So we have Thursday, Friday, Saturday, and Sunday, and you could see that the weekend

has a wider range than the weekday, and there's a higher total amount as well.

The medians are higher and the maximum is higher as well.

So again, conveying information, and it was fairly straightforward to make this.

We simply had to call the boxplot with our data to tips,

dataset, and say, x-axis is going to be by the column day, y-axis by total bill.

6:08

Now, sometimes you want to see the actual data points, and

the way to do this is with a Seaborn Swarmplot.

Let me just show you, this is exactly the same as the previous plot, but

rather than the boxplot, we see the actual data and

what Seaborn does is it adds jitter to the points, where basically it moves

them a little bit in the X direction in order to allow them to be seen.

If we didn't add the jitter we would just have a vertical line at each of these

columns and that might be confusing.

We wouldn't see the full range of the data.

So this is a powerful technique to see the actual data.

You can see and compare, there's clumps down here,

this is pretty clumpy right here, etc.

So sometimes swarmplots are a good way to see that intrinsic structure in a dataset.

Now the last thing that we want to look at is histogram.

Histograms are things you've probably seen.

They're somewhat like a bar chart.

To make a histogram, we simply call the histogram or hist method.

Here we're going to be passing in the Total Bill column and

we actually get a histogram out.

Now one thing that you should notice here is I've passed in this alpha parameter, if

you've seen that before, what that does is it effects the transparency of the color.

You should try running these notebooks and changing this to see how that appears.

If this is one, it's much more opaque, it's darker, more bold.

With a lower alpha it's a little more transparent, a little softer.

We also specify the font size,

which changes the size of the text labels that we have at our plot.

The rest of this notebook talks about changing things with histograms,

like the thinning.

The range the histogram goes over.

Various other techniques in terms of interpreting histograms,

like comparing multiple histograms.

I encourage you to go through, and look at these, and try to understand

how to make your own histograms, as well as boxplots, swarmplots, and rugplots.

If you have any questions about this, or histograms, or

these other techniques you've seen to visualize a one dimensional dataset,

please let us know in the course forum, and good luck.

[SOUND]