This course answers the questions, What is data visualization and What is the power of visualization? It also introduces core concepts such as dataset elements, data warehouses and exploratory querying, and combinations of visual variables for graphic usefulness, as well as the types of statistical graphs, tools that are essential to exploratory data analysis.

从本节课中

Statistical Graphics: Design Principles for the Most Widely Used Data Visualization Charts

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

K. Selcuk Candan

Professor of Computer Science and Engineering Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE)

As we talked about, part of exploratory data analysis is visualizing the data.

And we've discussed how to design pie charts,

the pie chart to just one element of the data detectives toolkit.

In this lecture, we'd like to also discuss elements of designing bar in line charts.

So, the first question that a designer needs to ask

himself is really which type of graph should I use?

And we have several different types that are

relatively easy to design using things like Excel,

Tablo, and other features.

So, the first one we already talked about was a pie chart.

So if we're comparing parts of the whole,

a pie chart might be a good type of graph to use.

So for example we talked about comparing the number of grades in a class.

So all the students make up the class,

each student has a grade and so we could

design a pie chart to show us the entirety of the class.

A bar chart maybe good for comparing between groups or even comparing data over time,

and a line chart is primarily used for exploring changes over time.

In this lecture, we're going to show some examples of different bar charts,

line charts, and their use.

And so this is an example of pie charts,

bar charts, and line charts for Non-Time Series Data.

In the upper left, I'm showing you some European Parliament Party Distributions.

So, for example, the leftmost bar here is EPP.

And EPP has about 225 members compared to the GUE/NGL has approximately 50, and so forth.

And so each of these bars,

the height of the bar is representing the number of people in a party.

So again, the categorical distribution.

And we can see the pie chart is showing the exact same thing as our bar chart here.

Each slice of the pie is representing one particular party.

So for example again, the EPP shown in this bar is the same

as the dark blue color shown here in this pie slice.

Now, we can see some of the problems with showing this as a line chart.

The line chart has just changed the bar representation directly to a line representation.

So each point where EPP meets this 225,

that point on the line represents the number of parliament members from EPP party.

But notice these connections along

the line give us some indication of connectivity between the data.

These are all independent categories.

So this gives the viewer some perception

of different connections that doesn't actually exist within the data.

Let alone we can also even take the data and reorder it.

And if I reordered the data,

I would wind up creating a line chart that looks

even more perceptually odd and may imply different connections there.

There's no reason why I can't switch

any bar with any other position on the graph just like

I can move any pie piece to any other position on the circle just by rotating.

And so with categorical and data,

we have to consider which type of graph is best

to use and often we stick with pie charts and bar charts,

and categorical data typically falls into nominal and ordered data.

Now for time series data,

we can look at the European Parliament election turnout.

So we can plot how many people turned out for the election in 1979,

1984, 1999 et cetera.

And so we can see the parliament turnout has had this downward trend,

and connecting these points over our line,

lets us see this trend in pattern whereas in the previous slide,

having this sort of trend by connecting the dots didn't

make sense because there was no connection between those categorical variables.

Here with the trend over time,

the line chart gives us the perception of change that we want to see.

We can do a similar thing with

the bar graph and notice that when we have separation between the bars,

we may not get such a strong idea of trends,

and with the pie chart if we try to plot each year as a slice of the pie,

we have no idea on any trends.

It's very difficult to tell.

So typically with this sort of data,

we would not at all want to use a pie chart,

and perhaps line chart is the best choice with a bar chart coming in second.

But there's plenty of examples of also when to not use bar charts.

So in this example,

we're showing car nationality for 1979,

and what this graph is doing is showing where each car type was made.

So this is nominal data,

also we call this categorical as well.

And the reason I say this is because each element Datsun 210 is a category of data,

its a name of a car thats why it's nominal.

And the bar goes to where the car was manufactured.

So USA, Germany, France, Sweden, Japan.

So we follow the bar here,

and we see the Datsun is manufactured in Japan.

But notice what we've done with this is given this one bars longer than another,

we might think well,

the Saab has the longest bar.

So the Saab must be the most important for some reason.

But all we've really done is set that the Saab is manufactured in Sweden.

The order of Sweden and France could be swapped in this case and if I swap those,

then cars manufactured in France such as the car would wind up being the longest bar.

So we're implying things about the data that don't actually exist in them meaning that

a bar chart when we're doing category called data by categorical data doesn't make sense.

We have a nominal data value of USA,

a nominal data value of the AMC Pace.

And we plot these as a bar,

I could rearrange either axis and not have any difference.

So instead we use bar chart primarily when we have

some sort of value for a data category.

So, if we knew the number of cars manufactured in the USA,

that could be a reasonable sort of estimate,

but we don't want to just put categorical variables on both axes for a bar chart.

We wind up with this weird design perceptual effect.

The other key perceptual effect understand for

bar and line charts is the graft aspect ratios.

So what happens is depending on how much we

stretch a graph either in the x or y direction,

the perception the trends and patterns is heavily influenced.

Take for example a graph like this,

and I've got a nice highly sloped line.

If I take the same data and I stretch it out really far here,

so let's say this is five,

and this value is five and now my five is way over here and my five is way up here,

I'm going to have a much less steep line,

and people are going to perceive and different trends in the data.

So depending on how I have adjusted my width to height aspect ratio of my design,

I can have different densities,

different relative distances and different orientations in the data.

And people have worked hard on trying to

define different metrics for designing graph layouts.

Cleveland proposed the idea of banking data to 45 degrees for the best perception.

What I mean by this is imagine again we have our line graph,

and again this could be by year so this can be your 1,

year 2, year 3, year 4,

and year 5, and this could be again the number of voters that turned out for an election.

And so each point gets connected by a line segment.

And Cleveland would have suggested that we want to try to get the average angle of

each line segment to be approximately back to plus or minus 45 degrees.

And we can think about incorporating this as well with

some Arc Length-Based Aspect Ratio to try to summarize across each I and J segment.

So this optimization equation here

is work from Pat Hanrahan group at Stanford try and think about how we can

optimize the line banking to improve our perception of line and bar drawings.

And here you can see different sorts of data sets

where they did the arc link based aspect ratio perception.

So all of these data sets across my X-Axis here are identical.

This is the same data as used to make this line graph but we see that

there are drastically different pictures depending on how they adjusted the aspect ratio.

And so for each of these, are trying to show how their arc length method compared to

some of the other methods for banking to

45 degrees or automatically calculating the aspect ratio.

And so again the reason I want to point this out for class is

to help you think clearly about when you're creating your design,

when you're creating a even PowerPoint slide and you're going to put

your pictures on here whether they're bar charts or whether they're a line graph,

you want to think about how wide and how tall you're going to make those and what sort of

story you're going to be presenting with that and how

the aspect ratio can impact how users perceive the data.

And so with this, we've talked about different uses for line charts and bar charts,

different reasons on when we might want to use both of those different methods as well as

how the length and width of the graphical space impacts the user's perceptions.