This course answers the questions, What is data visualization and What is the power of visualization? It also introduces core concepts such as dataset elements, data warehouses and exploratory querying, and combinations of visual variables for graphic usefulness, as well as the types of statistical graphs, tools that are essential to exploratory data analysis.

从本节课中

Exploratory Querying and Visual Variables Used in Data Exploration and Visualization

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

K. Selcuk Candan

Professor of Computer Science and Engineering Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE)

Introduction to Data Exploration and Visualization, Visual Variables.

In this module, our goal is to define

the properties of what Bertin called, visual variables.

Our goal is to define how to map data in a data set to a particular visual variable,

and describe what choices we actually have for these visual variables,

and what they mean in terms of drawing a mark on a piece of paper.

So first, we need to think about a data set.

Inside of a data set,

we have all of our variables.

So, we have things like your name,

your favorite coffee cup size,

your height, your GPA for example.

And each of these data elements,

each of these variables in your data set can be classified into one of four data types.

Nominal is data whose categories have no implied ordering.

So, things like color, like green, orange, blue, yellow,

or things like names,

and names have no implied ordering.

To ordinal data, data that has a specific order,

but no specified distance metrics.

So, coffee cup size,

the distance between Tall and Grande.

If we don't know the number of ounces per cup,

we just know the ones bigger than the other,

we can order them that way.

Interval data is data that has measurable distances like temperature.

And then ratio is the same as interval,

but includes a zero point.

And a better way to think about the difference between interval and ratio

is really this is additive and this is multiplicative.

What I mean by this is if I take one degree Celsius plus two degrees Celsius,

we know we get three degrees.

Now, is three degrees Celsius three times warmer?

Do you feel three times warmer at three degrees

Celsius than at one degree Celsius? Probably not.

So, we don't have the sort of multiplicative property there.

With ratio, is two inches twice as long as one inch?

Is three inches three times as long as one inch?

Is six inches three times as long as two inches?

So, that's where we think about ratio data.

Those are our data types,

and depending on the data columns we have such as name,

coffee cup size, height etc.,

we can map one of those to these data types.

And then we want to think about, well,

how can we start creating a picture of visualization of this data?

And so, we have the visualization pipeline were essential we take our raw data.

So, this is our data table.

So, we had name,

coffee cup size, etc.

This is our raw data.

We may perform some data analysis,

and part of this course is going to be data analytics,

learning different statistics, data mining

techniques for preparing data for visualization.

Now, there may be more things than just

preparing the data such as clustering and grouping.

We may need to smooth data, interpolate,

transform data, but eventually,

we do this and we wind up with prepared data.

So, we take our data set,

and we do some sort of maybe transformation to it.

Then we may also filter the data, we may say,

okay, well now, let's just try to pick things that are important.

I'm not be able to show all the data.

So now, we have a subset of the data called focused data.

And now, the key step that we're going to talking about today is really mapping the data.

Mapping the data to geometric primitives and their attributes.

So shape, size, color,

position on an axis.

Once we map each attribute to a primitive then we can render our scene,

and we're left with perhaps some sort of image,

maybe a scatter plot,

maybe a bar chart,

maybe a pie chart.

And we're going to talk about how to pick our different choices,

and how these choices will let us create a multitude of different visualizations.

And again, we're trying to map data to an aesthetic attribute or a visual attribute.

In visualization, we often focus on things like the form of the mark, or the surface.

If we're doing things in a computer,

we can even make motion,

so we can have things move,

we can map them the text,

or we could even map things to sound if we want.

So, we could have soundification,

we could have haptics, and other things.

What we're mostly interested in visualization is really the visual aesthetic attributes.

That doesn't mean that we can't have others,

they were mapping things, too.

And now, the aesthetic attributes,

these need to be capable of representing both continuous and categorical variables.

A continuous variable, an attribute needs

to vary primarily only on one psychophysical dimension.

So, we're not trying to perceive combinations of this.

If we have multiple attributes and multidimensional attributes,

we want to scale them on a single dimension.

Now, the problem is,

humans don't necessarily perceive things in a linear fashion,

so different perceptual abilities may not have a linear skill.

So, we have to be careful about our visual variable choices for different data.

And so, a lot of the skill in graphic design and visualization design is really

knowing what combination of visual attributes that we choose should be avoided,

as much as choosing good choices.

And so, in today's lecture,

we're going to talk about what are different visual variables are,

and how we can try to pick those.

And Bertin's visual variables are concerned primarily

with a mapping to things that we can draw on the paper.

So, he has seven visual variables that we're going to focus on.

Position on a common axis.

So, we have a common skill and we can pick a position there.

And position, we really think of actually as two visual variables

because we can have position on the X and position on the Y.

So, we wind up with our eight visual variables.

Size, the size of the mark.

Value, he represents a sort of different shade of the same color.

Color, he means actual different hues.

We can have a texture.

We can have an orientation and we can have a shape as well.

So, let's start with just defining position.

Position is just a location in a multi-dimensional space.

So, this is often good for continuous variables

that map to densely distributed locations.

So, if I have GPA for example,

and I want to map this along an axis,

and I've got a thousand students in our class,

we can have a whole bunch of people.

Then, these little marks are people in the class,

and they may be distributed in all sorts of

different ways and each mark represents their GPA.

Now, categorical variables, instead of mapping sort of anywhere along a scale,

we're going to wind up with a lattice.

So, this is a person's name and class.

Yeah, this is the name and class.

We have name one, name two, name three.

And remember some people can share the same name.

So if Mary is a common name,

we may have several people with the name Mary.

We may have just one person with the name Ross.

We may have a lot of people with the last name Smith or something.

And so, notice they form this lattice marking together.

Now, with categorical variables,

ordering may or may not have meaning in terms of what's being measured.

Remember with names, they have no implied ordering.

And so with position,

we have to be careful about some of those things.

So often, continuous variables are represented by position on an axis.

It's the best way to represent quantitative data visually.

And point or line lengths placed adjacent to

a common axis enabled judgements with the least biased or error.

So, if I have a skill,

and I'm just placing things on here, that's fine,

but if I can add tick marks with information like zero, one, two, three,

I can help people enable judgements on what value do we think this variable has.

We can also do this with size by varying

the length of a particular element or even the area.

Or, in three dimensions,

we can include volume.

So again, we can expand our visual variables.

Area and volume, though,

are often representations among the worst attributes for using graphing data.

However, length, if people are just comparing length,

they are quite good at that.

So here, we have different groupings of our data,

and different links representing different values,

and so we could have had something like,

oh, let's think about sales at Walmart.

So, let's think about cereal brands.

So, we have cereal one.

Serial two, serial three.

And this is going to be the amount of dollars sold for each one.

And we may even break this down further by,

we've got four variables here,

so we could have quarter one,

quarter two, quarter three, and quarter four.

All right. So, let's try this again.

So, we've get serial one.

We've got quarter one, two, three, four.

All right. And so,

I have the amount of money sold per quarter.

And so, this is serial one.

In the first quarter, serial one sold more than serial two, then serial three.

In the second quarter, serial two sold the most.

In the third quarter, serial one sold the most.

In the fourth quarter, serial three sold the most.

And so, here I've mapped color to the type of serial.

So, serial one was blue.

Serial two is orange. Serial three was green.

And then length to the dollar sales amount or quantitative data variable.

And so, in this way we can think about how we can design

our different visualizations by choosing our visual variable.

And size for lines is often usually equivalent to things like thickness.

So, if I'm using lines as a mark,

I may just change the thickness.

The size can also be used to great effect with path as well.

So, trying to think about things like Charles Minard's flow map for example.

Now, if we have objects with rotational symmetry,

so what I mean by size is really the size of our object,

the size of our mark.

Here, we looked at length,

so my mark was a box.

And I modified length for size.

I could have also modified width per size.

And I can choose different visual variables as well,

so I can do things like circles.

So, for objects with rotational symmetry,

we often map the size diameter rather than area.

So, notice this circle.

The diameter here is what we're comparing against.

And the real trick here is,

I ask you how much area is this 2x the area of this guy?

People are really bad at judging area.

That we're pretty good at judging lengths.

So, by comparing diameter,

we're able to get around some of those problems.

We can think about how to choose our different marks for visual representation.

And if we represent data to area volume,

we may want to use positively skewed data that can

benefit from the perceptual innate equivalent of the square root transformation.

Why is this a square transformation?

Well, area for circle for example,

is equal to pi R squared, so area.

It has a square function with regards to the radius.

And by positively skewed data,

we mean data that has a distribution that looks like this,

where there's not very many data elements here and there's more on this side.

So, we have some positive skewness,

and we can transform the data.

And now, we talked about size.

And size inherently also can go along with shape.

And the shape is the boundary of our mark.

So, shape can vary without affecting size,

rotation, and other attributes.

So for example, symbols on a map.

So, trying to think about how we can modify these,

how we can quickly compare shapes and different transformations?

And then, rotation is, we have our primitive.

So, if we picked a box for our primitive,

then the rotation is how rotated is that.

And so, this doesn't work for all shapes though, right?

If I have a circle and I rotate it,

I can't tell that it's been rotated.

So, things like lines, areas,

and surfaces can only rotate if they're positionally unconstrained.

Next, we have color.

And color is one of the more complicated visual variables.

We can have all sorts of different color maps from rainbow to sequential,

to grayscale, to divergent.

And the choice of color drastically depends on the underlying data.

We're going to have a whole module devoted

specifically to color choices and color scales.

But, we can map multiple variables to color.

So for example, we can even think about mapping two variables to color.

This light color means that we have a low variable one and a low variable two.

What I mean by this is, let's again go back to a data set example.

So, let's say I've got quiz one, quiz two.

We've had two quizzes in class and we have our students, so we have the name.

Let's say, Bob got a one on quiz one and quiz two, so Bob.

And then, let's say Ross and got a high score.

Maybe the quizzes are out of 10.

Ross get a 10 out of 10 on quiz one and quiz two.

So, since Bob got low scores on those,

we will give Bob a color in this quadrant.

Since Ross got high scores,

Ross would have a color in this quadrant.

We can have another person F. And they may have had five and five,

so they have a color here.

Now, what's the color here represent?

This means a high variable one,

which was quiz one, but a low quiz two.

So, that person might have had something like a 10 and a one.

Here for example, so now,

think about what means to be in this quadrant.

This quadrant and so forth.

Now along with color,

we can also have texture.

So, we can basically create a pattern of granularity of different elements,

different forms where we repeat this pattern per unit of area.

So for example, a bunch of Rs for example.

And then, the orientation of the pattern,

the granularity of the pattern can all represent different things.

The trick is that, perception texture alone can be a basis for perception.

So how we modify the patterns?

Here, we have a different orientation in R in a different density.

Here, we again have a different density.

Here, we have more different orientations.

And these patterns and orientations can also be mapped to a different visual variable.

And textures can be described in a variety of ways.

We can use different fourier transforms to decompose a grid

of brightness values into different trigonometric components,

or we can have autocorrelograms,

where we characterize the spatial moments of a texture.

And so, there's lots of complicated ways to try to create these textures,

to create different frequencies and patterns to

help people perceive and understand those.

And so, again, here I just want to take time

to wrap this up to show of the different visual variables we might have.

So again, with size we can think of sizes of marks if a mark is

a circle when you change the area or diameter.

If the mark is aligned,

the size might represent thickness.

If it's a box, it may represent width and height.

We can have a surface or a solid.

For shapes, we can have different circles, squares, triangles.

For a line, different shape may mean dashes between them.

For area, different polygons.

We have surfaces for shape.

We often don't think too much about solids, although it's possible.

With the rotation, we have rotations unconstrained for points.

For lines, we have sort of an unconstrained band, area, surfaces, solids.

Again, so three of our variables are size, shape, and rotation.

We also get brightness, hue,

and saturation, so three different types of color.

And then, we get texture,

or we get granularity of patterns and orientation.

And so, these visual variables form

the basis for all of the visualizations we can create.

Again, we have our data set.

So, we have things like name,

quiz one, quiz two.

We want to think about how we want to create our visualization.

Do I want a map name to color?

Do I want to map quiz one to length?

Do I want to map this to size?

How do I want to create my visualization,

and what visual variables are my choices?

We can also try other things that we haven't talked about like

blurring elements or making things more transparent.

And so, as we combine these together,

we get more and more complex perceptual phenomenons that occur.

And we want to think about what the combination of these are,