This course answers the questions, What is data visualization and What is the power of visualization? It also introduces core concepts such as dataset elements, data warehouses and exploratory querying, and combinations of visual variables for graphic usefulness, as well as the types of statistical graphs, tools that are essential to exploratory data analysis.

从本节课中

STATISTICAL GRAPHICS: DESIGN PRINCIPLES FOR Box Charts and QQ Plots

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

K. Selcuk Candan

Professor of Computer Science and Engineering Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE)

In previous lectures, we discussed bar charts,

line charts, histograms, and really,

we talked about how exploratory data analysis wants us

to help identify underlying distributions,

probabilities, and really understand what's going on

with the data through visual representations.

Another trick to add to our bag of tools is Quantiles.

In this lecture, we want to help you understand what exactly a quantile is.

And our goal is we want to think about how we can create and use Box-plots and Q-Q

Plots with this idea of Quantiles to help us

explore and understand the probability distribution space.

So, quantiles are simply points taken at regular intervals

from a cumulative distribution function of a random variable.

What we really mean by this is that we think about a data set where we have, again,

title for books and the number of pages.

Our histogram tried to bin up the number of pages to give us an idea of what

the distribution of pages might look like over the entire space of books.

What quantiles are trying to do is they're trying to split the data

into even ranges of probability,

so we would know exactly sort of the number of pages at which

25 percent of the books have less than this number of pages.

So, quantiles are just simply a way to define

regular intervals where we guarantee that we have 25 percent,

or X percent, not just 25 percent,

but any X percent of our data less than this value.

So, how do we calculate a quantile?

How do we understand this?

Well, let's take this example of Distribution of at-bats.

So, this is about 20 different samples of people from, oh,

I think this was 2007 baseball statistics of the top 20 players,

and this was the number of at-bats they've taken.

And what I want you to try to do is think about, well,

what's the probability that a person might have

taken this number of at-bats in the data set?

And now with quantiles,

the easiest way to calculate a quantile is to open up R or Python,

and just run your quantile command to calculate this.

But we want to try to also think about how to do this by hand to get

a better understanding of what exactly a quantile means.

So, the first step in figuring out quantiles is to sort the data.

So, step number one is sorting the data.

And so, I've already done this for us in the next slides,

so you'll notice that we have the same number of at-bats,

it's just that I've taken them and ordered them from lowest to highest,

and what we're trying to do with quantiles is figure out,

again, where the cutoff is that X percent of our data lies in a given range.

And so for example, I could say, "Okay,

I want to create five different quantiles."

And so, the user decides the number of quantiles.

So let's say, we want to have five quantiles.

What this means is I'm going to calculate the probability at 20 percent,

40 percent, 60 percent,

80 percent, and 100 percent.

And so, quantile really means,

the cutoff point at which 20 percent of the samples are less than this value.

So, how many is 20 percent of the samples?

Well, I first need to figure out how many samples I have in my data set.

So all that really is, is accounting profit using to count

how many baseball players I put in my data set here.

So, I have one, two, three, four, five, six, seven, eight, nine, 10,

11, 12, 13, 14,

15, 16, 17, 18, 19,

we have 20 baseball players in our data set.

All right. So, if I have 20 baseball players and I want five quantiles,

so 20 divided by five is four.

So since I've sorted the data,

and this is position number four,

then this is where my cutoff value,

my quantile is, so 514.

So I want to guarantee that four of my samples,

so four samples from my data set,

are going to have a at-bat less than this number of at-bats.

So since it's 514, and since this number was an integer,

I really want to take the average between sample four and sample five,

because 514, I have an equal to, not a less than.

So really, I can think about my 20th percentile here would be something like, 515 or 516,

so 515, I can guarantee that four samples are less than 515.

At the 40 percent,

so this was my quantile one,

and my quantile two,

I have to take two times my number of samples,

so two times 20, over the number of quantiles I want.

So now, this is going to be eight.

So, I go to my eighth position, five, six,

seven, eight, this is my cutoff there.

So again, I could take the average between these two numbers,

and I could get something like 550.

So, I can guarantee that eight samples in my data set are less than 550.

And so to continue, for 60 percent,

it's the quantile number times the number of

samples divided by the number of quantiles you have.

And so, for the 60th percentile,

this is our third quantile out of five times 20 samples divided by,

we're going to have five quantiles,

and so we wind up with 12.

So again, we go to our 12th number here,

and we could say something like

579 and a half because we take the average between those two.

And we can continue filling out the rest.

And again, I didn't have to take five quantiles,

I could have taken 10,

I could have taken 15, I could have taken 100.

All of those are just going to try to split this up into a more refined space.

The nice thing about this is that I'm sort of

guaranteeing an equal probability distribution.

The problem with this is,

is that my data samples could actually all be identical.

So all these people could have taken the exact same number of at-bats, and then,

I would have, let's say 500, 500, 500,

500, 500, and so forth.

So, there's no less than if I split my four sample,

if I had five quantiles, and 20 samples again,

I split here, I wouldn't be able to

calculate the quantiles for the data set that has all identical numbers.

So, just keep those things in mind where there are

some issues here that this is exactly what

a quantile is representing is the percentage chance

that a sample is going to be below this number.

So, we have a 20 percent chance that one of our samples is going to be below 515 at-bats.

We have a 40 percent chance that one of

our samples is going to be below 550, and so forth.

And quantiles are really useful because

they are less susceptible to long-tailed distributions and outliers.

Outliers don't really influence the data set.

If we think about our histogram,

if we have one book with a really huge number of pages,

I can wind up with one bin over here,

but then the rest of the data winds up smashed over into the lower end of the range.

Quantiles don't have this because

this one value point would just wind up in the upper quantile.

Often, quantiles are more descriptive statistics than means,

standard deviations, and so forth. So again, they help us

do more than just take a moment with the data. Quantiles