0:10

So in this lecture we're gonna have

kind of a very different sort of lecture than what we've been doing for

most of the class and we're just gonna talk about plotting your data.

And the reason I held off on talking about plotting your data til now

which might as well be the most important aspect of data analysis.

Is because I wanted to demonstrate that some of these plots are estimators,

they're multi-varied estimators but they estimate things like densities.

And we couldn't really introduce plotting to estimate densities unless everyone knew

what a density was.

So, let's get right to it.

0:47

One of the most well known forms of plot is just simple histograms.

Histograms just display a sample of the estimate of the density or

mask function and they're basically just bar graphs of the frequency or

proportion of times that a variable takes specific values.

Or bins of values for continuous data.

So it's probably easier to explain this with examples than with words.

So the data set islands in R, is the package that

contains the areas of all land masses in thousands of square miles, so

you can just load it into R by just typing in data islands.

1:26

And you can view the data by typing islands and

it'll just show you the list of numbers.

And we can create a histogram with a command hist of islands, and

if you just do question mark hist, it'll give you the options for the command hist

which includes things like been, lengths, and how you break up the histogram.

Picture we see that we had 41 islands that had this range of area.

You can see this is a very crummy histogram.

It doesn't tell you much information.

And the reason is because most of the islands are really small.

2:23

And there's only a handful of big ones.

And so, maybe there's something more informative that we can do while

Let's talk about the pros and cons of the histograms first.

So histograms are useful.

They're easy.

They make sense.

They work on discrete and even unordered data.

You can make a histogram of M&M colors or hair colors, or whatever.

They're just bar plots of frequency.

There's some problems with them, they use a lot of ink and space to display.

Not so much information,

you can replace with a table of the frequencies pretty easily.

And it's a little bit difficult to compare several at a time.

And then as I pointed before the specific use of the histogram for

this data set isn't very good.

You should maybe log the data to spread things out a little bit.

In this case, I did log base ten.

So it gives you orders of magnitude.

And then if you look, the histogram's much better.

And so the numbers now on the horizontal axis are log tens.

So you're looking at orders of magnitude, which is probably a lot better to look at.

3:25

Stem and leaf plots are just another way of really quickly on the fly,

if you ever need to create a histogram and all you have is a pen and

a piece of paper, then a stem and leaf plot is definitely what you want to do.

And it was created by John Tukey,

who was a famous statistician that created lots of inventions.

He was one of the co inventors of the fast discrete Fourier transform and

many, many other statistical and signal processing techniques.

A stem and leaf plot is, basically what you do is,

you have to pick a digit that you're going to kind of break the data on.

4:03

And then you put the digits to the right of that, you stack them up.

So here it's probably easier to just show this than to actually describe it so

I type in stem log ten of islands and the decimal point is at the horizontal line

here, which in the case is a bunch of pipe characters.

So, when we look at the one on the left, the one on the right means that's 1.1.

One island had land area ten to the 1.1.

Because remember we took log 10.

And then we can count, let's see how many is there?

There's one, two, three, four, five, six ones.

So there was six islands that had log area 1.1 and then because they were so

many in those bins they broke the 1 bin up into those below five.

And those above five.

So any rate you can see how you can do this really quickly.

You just pick the decimal place and the you just, the rounded number immediately

after the decimal place you just start stacking them up.

And then they wouldn't be in order so then maybe you'd do it again where you

reshuffle the numbers on the right so that they're in order.

It makes for a very convenient plot.

It gives you a quick histogram.

It's a quick density estimate, and you can do it on the fly really easily.

5:14

Another useful plot is a dotchart.

Dotcharts just display the entire data set one dot per point.

And you'll say well that seems like a pretty uninformative plot but

it's usually quite informative especially if you can order the dots well.

So the ordering of the dots and

the labeling of the axes can display a lot of information.

And so the dotcharts show an entire data set that could in

principle reconstruct the data set from a dotchart.

So it has really high data density, but there's problems with them.

They may be difficult to construct and difficult to interpret for

data sets with lots of points.

You may get over plotting and stuff like that.

So if you look at this dot chart where I did the dot chart for the log10 area,

that's log10 square miles, you see it just plots all of the data.

It's the same thing before but it gives you a good idea of the density points.

And you could probably make this chart a lot.

You could really improve on this chart, for

example, by maybe grouping land masses together.

Pacific Islands, maybe put all of those together, for example.

And so on, so you could get a lot more information out of it.

But at any rate, that's a dot chart there.

Good to make.

So yeah, just on the next slide I mentioned that I ordered everything

alphabetically, which is the default for the dotchart,

but you can play around with it.

Playing around with plots is the key thing you want to do,

you want to just keep doing them until you get informative information.

7:18

And you can obtain this data set with the command data(InsectSprays) and

then I give you the code for actually creating this plot there.

Maybe some of the fancier parts of the plots I omitted here, but

this is the basic gist of how I created the plot.

If you can generate better code to do this, I'd love to see it.

But anyway, then you look at the plot on the next page and

the nice thing about these plots is that it displays every single data point.

You get a very good sense of what's going on.

You can see that the spray C, D, and E appear to be different than the sprays A,

B, and F.

For example you have the confidence intervals for

each of the groups and you have the mean for each of the groups and

they display a lot of information for example.

You know for group F, you can see these kind of bi-modality of the distribution.

You can see this, sort of, outlier in group D, or maybe it's not an outlier but

nonetheless, you can visualize these things quite easily with a dot chart.

And because every group only has, you know, a handful of points

it would be a shame to aggregate them into something else that obscured the data.

8:23

But what you might ask yourself, but what if instead of having like ten or 15 or

20 points per group, what if I had 10,000 points per group?

What should I do.

Well, then maybe don't do a dot chart anymore, because you're gonna get so

much over-plotting that you're not gonna be able to see anything meaningful.

Maybe do something like a box plot.

So box plots were also invented by Tukey and

they basically just show the distribution in terms of quantiles.

So the center line of the boxes in a box plot represent the median,

while the box edges correspond to quartiles.

So the boxes give you some information about the density,

representative there only by three numbers.

And then the whiskers extend out to a constant times the inner quartile range,

which is the difference between the 75th and 25th percentile data.

Or they cap them off at the max value, and that constant was given.

Tukey did it by kind of relating it relative to standard normal.

And then sometimes outliers are denoted by points beyond the so-called whiskers.

And you can see skewness in the data by the centerline being near one

edge of the box or one of the other edges.

So if we just take the same insect spray data and did a box plot, in this

data set you probably don't wanna do that, but you do get the same relative picture.

You've lost some detail.

You've lost information about that bimodality or the density for

insect spray F, but you do notice, maybe something's going on there,

that distribution's very skewed towards higher values so

you might investigate it more and discover that kind of little two group clustering.

10:03

It does kinda catch these outliers for C and D but sometimes you have so

many data points that the outliers are just a constant mash of outliers and

then there's no point in plotting them.

But at any rate, this is a box plot.

I think for this particular data set you're better off doing the dot chart, but

if you had lots and

lots of observations a box plot seems like a pretty reasonable thing to do.

There's been improvements on box plots,

people say why do it as a box when I'd do it as something that uses a lot less ink,

and what's this constant times the IQR business?

That's maybe a little bit difficult to interpret.

And so there's been lots of kind of refinements, but

these are your plots ultimately when you create them so

you can use them to investigate the things you really wanna look at.

But it's a very reasonable idea to create in this case vertical

summaries of group data that are based on distributional properties, like means,

medians, quartiles and so on.

If your boxes get too squished try logging your data if it's positive.

Or maybe you could do a cube root if it's positive and negative.

11:07

For data with lots and lots of observations,

you often want to omit the outliers because you get this big mash of

black from over-plotting that you can't see individual points.

And there is no point in calling them outliers anymore if there's

hundreds of them.

Here's an example of a bad box plot.

I just give you some R code to generate a bad box plot, right.

It's all squished together, the outliers.

There's too many outliers being displayed and the fact that there's so

many outliers means that all the interesting aspects of the kind of

meat of the data are obscured by the handful of outliers.

So that's box plots.

So box plots kind of give you a density estimate by grabbing a bunch of

quantiles and using them from the data.

Kernel density estimates, on the other hand, are direct density estimates in

the same way histograms are direct density estimates.

But kernel density estimates maybe are a little bit better.

And the ideas that you're waiting observations according to a kernel,

in most cases that kernel is a Gaussian density, and

then you have to pick a parameter that determines how smooth or

jiggly your density estimate is going to be, called the bandwidth.

And your density estimate is itself a statistical estimate.

It has variability that you should probably investigate as well.

And you should investigate how the bandwidth impacts that variability and

the estimate itself.

But it's not like this is something that's just unique to kernel density estimates.

For example, if you take a histogram, the width and

construction of the bins in a histogram play the same role as the bandwidth, so

you still have that tuning parameter you have to work with.

And again,

the width and the number of bins in a histogram can impact what it looks like.

But in addition, a histogram's also an estimate with noise, so

both kernel density estimates and histograms, and so on,

they all are statistical estimates that have variation and

it's maybe unfortunate that I'm gonna do this as well that, when you plot these

things, you don't explicitly acknowledge the uncertainty in the density estimation.

So that is kind of a problem.

But maybe the solutions to it are a little bit above the discussion for this class.

So anyway, the R function density can be used to just create density estimate.

So here's the waiting and eruption times in minutes

between eruptions of the Old Faithful Geyser in Yellowstone National Park.

You can grab this data by just doing data(faithful), and

d is the density estimate and

this bandwidth parameter here gives a specific rule for selecting the bandwidth

of the density estimate, and then the plot creates the plot.

So there's our density estimate, and

it actually gives you the specific bandwidth it used.

And you can see there's an incredibly obvious feature in this data set at

4.5 minutes, let's say, that the eruption seemed to occur in two time periods.

But you also get a sense of the variation around those eruption times as well.

So anyways, kernel density estimates are a nice way to estimate a density and

maybe are an improvement over a histogram, I think by smoothing out the data.

14:13

Here's another exam.

I took an MRI.

I took a single axial slice and then I disregarded the spacial locations.

So I just have a bunch of color intensities that are on a gray scale.

So here's the image.

You can see this is an axial slice.

Here's the ventricles.

Here is gray matter.

Here is white matter.

Here's the skull.

And if you were to just take these collection of numbers, or

the intensities of the numbers, and disregard where in the image they are and

just treat them as a list of numbers, and put that into a kernel density estimate,

you might get something like this, where you can see kind of specifically where you

have this lots of background voxels, you have a kind of a hump for

gray matter voxels, a hump for the white matter voxels, and so on.

14:58

And this is a pretty common technique.

In fact, it's so common that it's built into your camera, right?

Your digital camera,

if you have one, often will have a histogram estimate built into it.

You can actually look at the image histogram.

And if you don't do it on a digital camera,

certainly whatever image processing software you have

will actually give you a histogram of the intensity values from the image.

And this is exactly what they're doing.

And if they do a kernel density estimate, they're smoothing out that histogram.

If they do boxes, then they're just discourtising the histogram.

Quantile plots are extremely useful for

comparing a distribution to a theoretical distribution.

So a great example of this is if you want to suggest that your data is normally

distributed, you might want to compare your empirical quantiles from your

data to the theoretical quantiles of a normal distribution.

So if there is a significant departure from a line,

then that's going to tell you that the quantiles of your empirical data

don't look like the quantiles from a theoretical normal distribution.

Then it's a useful diagnostic tool.

And the reason it's useful is, unlike histograms,

you could do a histogram plot of your data and then compare that, overlay it, say,

on a standard normal plot.

The reason is that QQ-plots are good is that they kind of focus

exactly in on the comparison between the two distributions, you know,

quantile by quantile.

And they really tend to highlight the differences much more effectively than,

say, overlaying two histograms.

It's kind of hard to tell.

But here, you know, here's why you want to check for whether or not it's a line.

So let's let Xp be the pth quantile from a normal mu sigma squared.

Distribution is nonstandard normal.

Then by definition of probability, x being less than or equal to x sub p is p.

17:05

So we've just basically converted the x random variable to a z random variable.

So then, you know, you can go back and forth between the x quantile,

x of p is mu, plus the z quantile, z sub p times sigma.

So, again, I put here, this should not be news to you, that you can convert between

nonstandard quantiles and standard quantiles by either standardizing

the nonstandard quantiles or by doing mu plus sigma times the standard quantiles.

17:36

So any rate, the result is quantiles from any nonstandard normal distribution

should be linearly related to standard normal quantiles.

So what a normal QQ-plot does, for example, is it plots the empirical

quantiles of your data versus theoretical standard normal quantiles.

And in R qqnorm, it does a QQ-plot, and then qqplot just basically plots your

empirical quantiles versus theoretical quantiles from any distribution.

And here's an example of a normal Q-Q Plot.

And in this plot it's basically saying that at the high end your sample quantiles

are too large, and at the low end your sample quantiles are too small negative.

So in this case it means your data is heavier tailed than a standard normal.

It has Excessively large upper quantiles and

excessively small negatively lower quantiles.

In this example your upper quantiles are too large,

right, and your lower quantiles are all smooshed up at zero, and that.

Would be indicative of an instance where your data follows some

right-skewed distribution rather than a standard distribution.

I think in this case to generate this plot,

I used a gamma and compared it to a standard normal.

And then here's an example where I generated data from a normal distribution

And plotted the Quantile-Quantile plot versus the actual normal distribution,

and of course, it looks pretty good.

Again, with the QQ plot, the theoretical quantiles are,

of course, exactly right, but the sample quantiles are measured with noise, so

the normal QQ plot again doesn't account for the uncertainty

in estimating those quantiles and so really you should have for these QQ plots

maybe some grey lines to indicate the uncertainty around the plot itself.

20:01

And you wanted some plot of the bivariate distribution for

two discreet random variables.

Well, here's Fisher's data on hair and eye color, right?

So we want to talk about the distribution of hair and

eye color when Fisher was looking at people from a particular area.

And here's the contingency table down here, and

you see the different hair and eye colors.

And one plot that you could do to look at this is the so-called mosaic plot.

20:31

Mosaic plot just sort of breaks everything up into squares and rectangles

where the size of the rectangles represents the size of the counts, and

so it gives you a, a pretty immediate way to look at

the bivariate distribution of the two variables.

It's sort of like getting at a two-dimensional bar chart.

It's a quick display.

You know, another thing you could do perhaps is some sort of 3D bar chart,

but what's nice about this is it doesn't require, it's standing it out to a third

dimension and gives you a quick way to look, you can see the low counts for

red hair, for example, across all eye colors.

And you can see that really quickly and

obviously in the large, green being very consistent across eye colors,

for example, but that fair hair seems to change with eye color quite a bit.

So you can pick out these patterns really quickly.

And so I'm going to say mosaic pots are nice,

if maybe a little underused techniques in plotting.

21:35

So that was a whirlwind tour of some basic plotting techniques that you can use and

hopefully will get you up to speed and running really quickly with plotting.

I wanted to mention at the end though that you should really

not constrain yourself when you're plotting your data.

You know, think of these techniques or anything's fair game.

Plotting exploratory data analysis is an essential component of applied statistics,

and you can't really do any of the probability modeling that I'm suggesting

unless you dive into the data a little bit first.

It will give you a sense of whether your

probability modeling is ridiculous before you even start.

So it's an essential part.

And today we just gave you a handful of techniques, but when confronted with

a problem, you should attack the data with as many plots as you can think of and

they tend to be very informative.

I think Tooki called it intraocular content, in other words,

that the conclusion sort of hit you right between the eyes.

And that's what plots can do for

you that probability models can't do anywhere near as well.