This course answers the questions, What is data visualization and What is the power of visualization? It also introduces core concepts such as dataset elements, data warehouses and exploratory querying, and combinations of visual variables for graphic usefulness, as well as the types of statistical graphs, tools that are essential to exploratory data analysis.

从本节课中

Statistical Graphics: Design Principles for the Most Widely Used Data Visualization Charts

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

K. Selcuk Candan

Professor of Computer Science and Engineering Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE)

In previous modules, we discussed

exploratory data analysis tools and introductory statistical graphics such as pie charts,

line charts, and bar charts.

But along with those types of visualizations,

we also really want to consider the non-data components of graphs.

And so, in this module,

we're going to learn about design principles for bar charts,

line charts, and methods on labeling axes and data elements within the plots.

So, key features for line charts and bar charts are really the axes, legends, and scales.

Previously, we discussed about how aspect ratio can play

a huge role in the perception of different line charts and bar charts.

But the axes labels can also have a huge effect.

We want to be able to look at

a graph and also be able to discern data elements from there.

So, for example here,

we're showing three different lines.

We've got three different time series: series one, two and three.

We've got different categories of data,

so this could have been year one,

year two, year three, year four.

And we want to be able to have a user tell us,

"What value is this point that I've circled here?"

It's hard to do. We know it's close to four.

We know it's less than six.

Is this 4.2?

Is this 4.1?

How do we help people try to make good estimates of that?

Do these major grid lines cause clutter in the graph?

How do people understand those?

And what are our best practices for labeling legend placement,

adding scales, and so forth to these sorts of images.

So, let's just first think about how we might consider

a problem where we have a bunch of data from 105-543.

What I mean by this is you're given some sort of table of data.

And maybe this is price per furniture,

so maybe you have a type of furniture,

and then you have a price.

So maybe you have a couch and this was $543.

And maybe you have a mirror,

and maybe this was $105.

Maybe we have a chair,

and this was $211, and so forth.

And it turns out that we can find the maximum and minimum value

in this table and that's how we get our data range, 105-543.

And we want to think about, well,

how do I split up this data?

If I want to plot the number of couches sold,

the number of mirrors sold,

and so forth, how do I decide how to do this?

How do I decide how to break up this data range into meaningful numbers?

And so, if I'm going to plot,

for example, the price per piece of furniture, so,

this is my furniture on this axis,

and this is my cost.

So, for example, I can make a bar graph where this is my couch,

and I'm going to make a bar here,

and it's going to be 543.

How do I want to label my axes?

Should I have labeled a tick mark here with 543?

Is that the best number choice?

Well, it turns out that people often think about things that we describe as nice numbers.

People like round numbers,

powers of fives, powers of 10s, powers of two.

So we want to think about, how can we split up

this data range to cover those nice numbers?

Do I need my values to go from 0-543?

If I'm talking about price,

I may actually still want to have zero on here,

even though my min is 105,

because people want to know where sort of the four is,

what's the cheapest thing I can get?

It could be that I just want to show more variance in my data range,

so I may want to go from 105-543.

But again, I may want to think about,

what's a nice number?

Should I instead want to take this maybe from 100-550,

because then I could label my axes here 100,

and I could put a tick mark up here for 550,

and then we could not have to necessarily have a tick mark here,

but we could have 550, we could have 525 or even 500.

We could go every 50.

And so, there were people that came up with different ideas for

labeling algorithms on how to decide how to

best organize our data ranges and data labels.

And notice we can have similar problems when our data range goes from 2.03 to 2.17.

Again, how do I split this up?

Should this instead go from 2-2.2?

Should it go from 2-2.25?

How are people going to best understand this data range?

How are they going to interpret the axes numbers?

And Heckbert created the labeling algorithm where he's

trying to optimize on things like nice numbers,

where you're going to have to think about your design space.

So, if this is the area of the screen where I can create my graph,

how much room do I have for my tick marks?

How many tick marks do I want on my graph?

How close, or how packed do I want my labels?

If I have to write a really,

really large number like 2.05 and I only have so much space,

can I fit another label right here?

Now they're really cluttered together there.

We can't really read them very well.

So I need to have spacing between my labels, too.

So this code I'm showing here was Heckbert's nice number for graph labels,

where he's trying to think about things like the desired number of tick marks,

the tick mark spacing,

the graph range minimum and maximum,

how many different digits to show.

And so, he's created an optimization problem to try to organize

the exact number of tick marks for a particular graph.

And so the problem is, for small numbers,

the range of labels can be a lot larger than the data range.

So, for example, if we had that 2.03-2.17,

notice now I've got to the hundredth decimal place here.

So how am I going to create all these labels?

Do I really need 100 labels to go from 2.03-2.17, 14 labels?

How many labels make sense there?

So, one solution is to drop labels which overlap or fall outside the data range.

But this can lead to unevenly spaced labels,

or axes with only one label.

Or, like I said, if we want a zero point somewhere,

to help ground the user in the data when we're talking about price,

or number of votes, or things like that,

we may actually want to even make our range larger.

And so, Wilkinson came up with ideas on how

to find an optimal number of tick marks, and labels, and things.

And so, there's been a lot of work in creating different algorithms,

similar to that one I showed you from Heckbert,

on how to automatically create different labels for graphs.

And so, here's the same data set again with four different labeling schemes.

So here's Heckbert's labeling scheme.

You notice he wound up with three labels on the x axis,

and four on the y axis.

Wilkinson differed in he had,

again, four labels on the x axis,

on the y axis, but he had seven labels on the x axis here.

So a lot more. But notice his numbers maybe aren't as nice,

where he goes 36, and here is of course 36 here, the minimum.

But maybe 20 is nicer for the user to understand.

Here we can see R's pretty,

where it winds up putting the 36 at the very bottom so we don't get a lot of white space.

But we don't have nearly as many labels as

Wilkinson's and Wilkinson's extended tick mark algorithm

looks like a combination between those two,

where we get a little bit nicer numbers labelled.

Now the nice thing about these is a lot of

these algorithms have already been implemented and developed.

So R's pretty, for example,

if you're an expert with R code,

and you're using the R statistical package to create these different graphs,

you can just call the Pretty command to create nice labels for your graphs.

And the different things people are trying to optimize on range

from coverage of the tick marks, so,

how much of the data range do they cover, the legibility,

notice this has font size as one of the parameters,

so, how big is my font?

What orientation, am I going to make this horizontal, vertical?

How much overlap do I want to try to prevent,

so that we don't have labels running into each other?

And this labeling problem,

where to place labels,

has been a challenging problem in computer graphics for a long time.

As you get more and more data,

especially in this era of big data,

and we're trying to add more and more elements to explain things to the user,

we want to consider how legible the resulting graph needs to be.

And so, we have all of these different non-data components of graphs,

that we can think about how to optimize

label placement for being the most legible, the most readable.

So we need to not only think about what's the right type of graph

to show to the user given our type of data,

what type of graph let's us explore the data and analyze this.

We also need to think about the width and

height of the area that we're drawing the graph in,

and how many tick marks,

how to place them,

what font size, and then where to fit in the different legend to these, as well.