This course answers the questions, What is data visualization and What is the power of visualization? It also introduces core concepts such as dataset elements, data warehouses and exploratory querying, and combinations of visual variables for graphic usefulness, as well as the types of statistical graphs, tools that are essential to exploratory data analysis.

从本节课中

STATISTICAL GRAPHICS: DESIGN PRINCIPLES FOR Box Charts and QQ Plots

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

K. Selcuk Candan

Professor of Computer Science and Engineering Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE)

Previously, we talked about how to calculate quantiles to help us

understand [COUGH] different distributions in a data set.

But we didn't really talk about ways to visualize those quantiles.

We've talked about line charts, bar charts, histograms.

But another key [COUGH] tool for

a data detective is this idea of a box-and-whisker plot.

And a box-and-whisker plot is typically used to represent what we call quartiles.

And quartiles are just simply four quantiles.

So remember, we talked about how to calculate quantiles, so

the user defined the number of quantiles they wanted,

And then we had the number of samples, divided by the number of quantiles,

times the quantile number that we were at, and we calculated all those values.

Well, if this is equal to 4, we call this instead a quartile.

And from the quartiles, we can create what we call a box-and-whisker plot.

So this first line in the box plot is our Q1 value,

the second line is our Q2, and this is our Q3.

And then the whiskers are often used to represent a variety of different things,

most commonly is probably the min and max value in the data range,

which allows you to sort of see if there's outliers, skewness.

And this Q2 for quartiles is what we call the median.

So, notice this is not the mean, but this is the median.

This is the point at which we can sort of estimate that 50% of the samples of

our data are going to be less than that value.

[COUGH] And

there's a whole lot of different things we can encode with the box-and-whisker plots.

But like I said, the main ones are the lower quartile, the upper quartile,

and the median.

[COUGH] We also have upper extreme, lower extreme.

But we get to decide how we want to draw these whiskers.

And we also get choices on things like the width of the box as well, too.

And given these [COUGH] different properties of our box-and-whisker plot,

we can actually create alternate forms of the box-and-whisker plot too.

So, for example, we can modify the width of the box.

So we could map this to the size of a group, for example.

So, if we're comparing, perhaps, in the previous example,

we showed baseball players, so we could look at at-bats from the American League

versus at-bats from the National League.

And the width of the box could really be the number of players in

the American League versus the National League.

We could think about this for pages in a book in the library.

So we could have non-fiction books versus fiction books, [COUGH] and

we could group this out.

And we can make the width proportional to things like the square root size or

the group size.

We could use notches, where the notches to these different lines,

like here, can be proportional to the interquartile range.

And this idea of interquartile range has come up before in data detective.

And what we mean by interquartile range, so

this queue is not inter-quantile range, but again interquartile range,

which means we've got our four values, so we've got our R instead of an N.

[COUGH] This is going to be Q3- Q1 is our IQR.

So it's just the height, the length of that box.

And so we can quickly calculate the IQR by solving for our quantiles.

Now, remember, the first step is to sort the data.

And then the second step is to take our quantile number,

times the number of samples, divided by the number of quantiles.

And we go through and get all of our values.

And so, a box plot typically just focuses on four different quantiles.

So we can go through an example here on how to create a box-and-whisker plot

again, now taking a batting average.

So I've got a group of players, I have 10 players.

And I want to make a [COUGH] box plot based on,

let's do the RBIs here.

I'm sorry, not RBIs, let's do batting average,

because I've already sorted the column for batting average.

So remember, the first step is to sort the data.

The second step is we have to take the number of quantiles, so

we're going to calculate Q1.

So this is going to be 1 times number of samples, 1 times 10,

Divided by, and we're going to have 4 because we want quartiles.

And so, 10 divided by 4, so Q1 is going to go to 2.5.

So we have to guarantee that two and a half samples.

So we take the ceiling, so this gives us 3.

So one, two, here's the third sample.

[COUGH] So we can say Q1 is going to equal 0.307.

I can guarantee that two and a half samples are less than 0.307.

I can find Q2.

Q2 is going to be 2 times 10 over 4,

so that's going to be 5, so 4, 5.

Now, notice that my fifth sample, I've got 3.12.

I can't take the average between the fifth and sixth sample,

because they're both 3.12.

[COUGH] And so instead, I have to go all the way up to this sample of 3.15.

And so, I may want to take the average between the 3.12 and

the 3.15 to guarantee that 50% of my data is less than this value.

So I may wind up with something like [COUGH] 0.313 here for Q2.

Q3 is going to be 3 times 10 over 4, so we're going to get 7.5.

We can round up to the ceiling here, so we've got 5, 6, 7, 8, so we get 0.321.

So I guarantee that seven and a half of my samples are less than 3.21.

And then Q4 is just going to be my max, so 0.336.

And now I can draw my box plot.

And of course I want to scale an axis, but this line would represent my 0.307.

My median line would represent my 0.313.

My Q3 line would represent my 0.321.

And then I have my max and my min here.

And that's how we create our box plot.

Now, imagine if I had another group, if this was American League,

I can draw another group next to it.

And this allows me to directly compare data distributions between

different groups.

So, for example, if I have group A and group B,

I can draw their box plots next to each other, I can see that group B has a much,

much lower, let's say, batting average than group A.

But we can see that group A has a much,

much longer skew in the distribution here,

where the distance from Q3 to Q2 is smaller than Q2 to Q1, but

much, much closer in proportion than in group A.

The other additions we can make is,

instead of having the whiskers be max and min, is instead,

sometimes the whiskers are equal to 2 times the IQR.

And so the whisker may wind up being here and here.

And even though there's samples outside of those,

we'd then draw the samples with little stars, and those stars may be outliers.

So the whiskers can be defined based on [COUGH] some method of

outlier detection, automation.

And so, again, we can take our knowledge of quantiles and transfer this to a visual

representation, the most common of which is the box-and-whisker plot.

It allows us to quickly see things like the median, the interquartile range,

potential outliers, and

allows us to even compare statistical distributions between different data sets.