This course answers the questions, What is data visualization and What is the power of visualization? It also introduces core concepts such as dataset elements, data warehouses and exploratory querying, and combinations of visual variables for graphic usefulness, as well as the types of statistical graphs, tools that are essential to exploratory data analysis.

从本节课中

Statistical Graphics: Design Principles for the Most Widely Used Data Visualization Charts

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

K. Selcuk Candan

Professor of Computer Science and Engineering Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE)

In previous modules, we discussed introductory statistical graphics including pie charts,

line charts and bar charts.

We're given a table of data with nominal, interval, ordinal,

or ratio data, different categories of data, times thorugh data.

We talked about how we could create a graph to begin looking at those.

A histogram uses methods of aggregation to try to look at

the distribution of data within a certain data set.

And typically, they're represented by

very specialized bar charts showing this sort of binning of different data.

So in this lecture,

we're going to define the design principles for histograms and

the impact of parameter choices on the resultant visualization.

And so, here's an example of a histogram for book length,

so pages in a library.

And so we could imagine our data set looking something like this.

We've got the title of the book,

and we have the number of pages.

So we have, for example, Huck Finn.

And we have, I don't know, 1,112 pages.

We have War and Peace,

and we've got 3,267 pages.

We have Harry Potter.

And we have 972 pages and so forth.

And so with the data set like this,

we previously talked about,

and just making a bar chart where we had each book,

and the number of pages.

So we had each bar. Separately each bar was a category.

And so we had Huck Finn,

we had War and Peace, we had Harry Potter.

But this doesn't tell us about

the average number of pages a book has in the library, the standard deviation.

What we may want to know is really the distribution

of the number of pages within a data set.

And what a histogram is doing,

is it's going to try to think about how we can aggregate the data

together and look at the shape of how this variable,

how the number of pages,

is distributed within the data set.

This goes back to the idea of data distribution.

I'm wanting to take a moment with the data and really

understand the different underlying properties there.

So in terms of the title,

we may want to know how many different titles of books we have,

but we may want to know things like the average number of

pages or the number of pages distributed in a normal fashion.

Is there some sort of bi-modal distribution,

where there's two peaks in the data set?

We can also even think about splitting this up into

then different categories of fiction, nonfiction, sci-fi,

so we can slice and dice our data in a whole lot of different directions to

get different details on information.

And to create a histogram,

what we're trying to do is really think about how to create a bin.

So for example, this bin goes from 22-36,

and the next bin goes from 36-50.

So 22-36 is a width of 14.

So every 14 pages,

we create a bin.

So we count the number of books that have between 22 and 36 pages and that's one.

Here, we look at the number of books between 232 and 246 pages, and we have 84.

So we're not looking at anything to do with the title column.

What we're worried about really is how do we count the number of

books in the data set with a page number between that range?

And when we create a histogram,

we wind up with sort of two choices.

We wind up with the number of bins.

So what I mean is, how many bars are we going to show on the graph?

And we wind up with the bin width.

So in this example, the bin width was R14.

And the choice of those two parameters is greatly going to

influence what the histogram looks like.

So for example, the user can specify the number of bins.

So if we have the bins be a variable K in this case,

the user can specify K or we can choose them from a suggested bin with h. And so again,

the user is going to either suggest h or suggest K. And once we have one of these,

then we can get the other one.

If I give you K, you can also find h. And the max and

the min is going to be that data range we were looking for.

So in our data example where I had the title and the number of pages,

this is going to be the max value in this column,

so which book has the most number of pages.

And this is going to be the minimum value,

which book has the smallest number of pages.

So for example, if we had War and Peace,

and we said this was 2,171 pages,

and we have a whole list.

And maybe somewhere, we have

one of our little golden books for children that has about 10 pages.

If we do the max and min on this data set,

we'd wind up with 10 and 2,171.

And now, I realized though, in our data set that,

we could have had a whole bunch of books above there, below there.

And we just need to call the max and min function to find those.

Then the user can say, "Well, okay.

I only want to have seven bins in my histogram,

so then I need to solve for h." And I can solve for h by simply swapping,

and I can have max of my data minus min of

my data over K. And that'll let me solve for h,

where h is the bin width.

But it turns out,

there's a lot of common choices for choosing K. Normally,

we don't often let the user just tell us how many bins they want,

instead, we try to let the data lead us to the correct choice.

And one of the most common choices for the number of bins, which is K,

remember K again is the number of bins,

is to take the square root of the number of data items we have.

And so, what do I mean by N?

Well, again, if I have a data set where it's title and number of pages,

N is going to be the number of rows in my data set.

So, if I have 100 books in the library,

then K is going to be the square root of 100.

And so, in this case, K would equal 10.

Now, of course, you want to think about,

I can't have a fractional number of bins.

So, typically, you'll take either the ceiling or the floor for this bin choice.

Now, this particular choice works really well if the data,

so this column of data, is already normally distributed.

What we're trying to always do is think about how we can

make the data fit some sort of normal distribution pattern.

And this is where things like Sturge's formula comes in.

So, this is trying to think about transforming the data on a log scale.

So again, remember N is the number of rows in our data set.

Here, Scott's choice tries to learn more about

the data by taking the standard deviation of the data,

and dividing by the cube root of the number of samples in the data.

And the Freedman-Daiconis rule use

the interquartile range along with the cube root of the number of data samples.

And we're going to talk more about what IQR

is in a different module, a different lecture.

And so for this histogram,

imagine again, we have a 1,000 books.

So, we've got our title,

we've got our number of pages,

and we've got 1,000 rows here.

And so we can use all of those common choices that I just showed, Scott's choice,

square root choice, Freedman-Daiconis,

to create a histogram of the data.

And all I'm doing to create a histogram is,

let's say that my page range goes from the biggest book has 10,000 pages,

and the smallest book has one page.

Okay. So, if I do this,

and I've got a 1,000 samples,

my square-root choice is going to be the square-root of 1,000.

All right? So, how many bins fall into the squared of 1,000?

So, 10 squared is 100,

20 squared is 400,

30 squared is 900.

So, we're somewhere above 30 bins, right?

And we can put this in our calculator, figure this out.

But, once we know the number of bins,

we can then find the bin width.

So, let's just say in this case,

we want to just have a user defined K equals four.

So, we're going to have four bins,

if we've got my book pages now range from 10,000 to one.

I have to figure out my bin width.

So, I've got 10,000 minus one.

This was my max X,

minus my min X, divided by H,

is going to equal K, and if user defined K is four, now,

I can solve for H. And maybe to make this nicer I don't want to use my Min x as one,

I can say, okay, well, a nicer number would have been zero.

We talked about nice numbers before,

so I can take 10,000 divided by four and that would give me 2,500.

So, what this means is now,

every range is going to go from zero to 2,500

to 5,000 to 7,500 to 10,000.

And what I do is for each book, each row here,

if this particular book has 10,000 pages,

that means I add one to this bin.

If this book has 6,789 pages,

that's between 5,000-7,500, so I add one to this bin.

Let's say again, this has 6,215 pages,

I add another one to this bin because it's between 5,000-7,500.

So, histogram is just counting up how many things fall in each bin,

and we figure out this K and this H based on our different rules.

And so for our imaginary data set here,

the square root choice looks something like this,

where we take the square root of 1,000 books and we wind up with over 30 bins,

and we get a distribution like this,

and we can sort of see, well,

maybe there's two peaks if I connect these with lines,

I see this little dip here at 38.

And so, I may have the impression that perhaps there is

some multi-modality going on in the data set.

However, if I use Sturge's formula,

I wind up with a very,

very smaller amount of bins,

and I wind up with what looks like a nice sort of

almost normal distribution even skewed a little bit towards the lower edges here.

And now, normally in a histogram, too,

we wouldn't allow any space in between the bins because

the spacing gives us some idea that there may perhaps be category between the data.

So this is part of the problem with trying to draw things like this in Excel,

then why we're going to learn things in Pythons,

we have more control over the design space.

Scott's choice, again, looks very similar.

We see this sort of nice normal distribution pattern for the most part.

Again, different number of bins.

And the Freedman-Diaconis rule.

And so, we can see, depending on the underlying data distribution,

the histogram can give us

slightly different views of the data distributions and it's really

impacted heavily by the number of bins and the bin width.

So those are the choices that we get to make for the design aspects of this.

And again, dependent on what we choose,

we can have drastically different looking histograms that can even cover up things if

we do wind up with something like

a multi-modality and dependent on how I've bin the data,

I could transform data that looks like this into this.

Now what's interesting is,

histograms are primarily our first look data analytics tool.

We're trying to learn about what the underlying probability distribution of that data is,

how likely is it that we might find a book with only 10 pages

versus a book with 10,000 pages and things like this.

So this is often sort of your go-to tool for data detectives and try to look at,

explore, and understand data.

So oftentimes, you may have a book title,

you may have things like the number of pages,

you may even have a category and how much the book sold for or the cost,

you may know even the quantity of books sold.

And for each column,

you may want to make a histogram so you can understand what's going on with the data.

And the main thing there is,

you can also quickly tell if there's errors in the data.

Imagine if my histogram looked like this for the number of pages.

Why would there be so many books with zero to whatever number of

pages this likely means that something in the page column has an error.

So again, histograms are often our first look tool for thinking

about what's going on in each of our data columns.

And these are typically not nominal or ordinal data but these

are ratio or interval data.

And so again, we get the choice of bin width,

the choice of number of bins,

pack bins together like we show in this picture here,

and we have to think again about designing labels and designing

aspect ratio and all of these elements

we've been learning about go into the creation of this.