Covering the tools and techniques of both multivariate and geographical analysis, this course provides hands-on experience visualizing data that represents multiple variables. This course will use statistical techniques and software to develop and analyze geographical knowledge.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

Huan Liu

Professor: Computer Science and Engineering School of Computing, Informatics, and Decision Systems Engineering (CASCADE)

Data exploration and visualization,

geographic analysis, data classifications.

In this module, our goal is to differentiate types of geographic visualizations,

specifically focusing on Core Plot maps and how we can create our different map classes.

As we talked about in previous lectures,

we talked about histogram binning to

organized data to sort of aggregate different things into different boxes.

We had lectures on data mining,

and we talked about methods for supervised and

unsupervised learning that may result in different classes.

And all these different methods can be applied to creating our Core Plot map.

The underline choices in a Core Plot map are really the choice of the color scale,

which we discussed in our previous module,

and then the choice of the classification method being used.

And so in this lecture,

we're going to discuss

different classification methods that are commonly used for creating a Core Plot map.

And, really, this is a class interval selection.

We're trying to choose the different ranges where we can

bin different geographic regions in order to create our Core Plot map.

And there's a lot of choices for optimizing the class interval selection,

and they're highly dependent on the underlying data distribution.

In previous lectures, we talked about data transformations to help us with these things.

And this is extremely similar to those lectures on histogram binning.

And there's a variety of popular choices for class interval selection,

including equal interval selection,

Jenks' natural break, minimum boundary error.

And what we're going to see is actually

equal interval selection is pretty much

identical to some of those early histogram methods that we learned.

So first let's come up with our dataset.

So, again, let's say we have a bunch of different counties.

And let's say we want to measure the number of hospitals in a county.

All right. So we've got county one,

two, three, four, five.

And maybe we've got zero hospitals in county one. Maybe we have 10.

Maybe we have 10 again in county 3.

Maybe we have seven.

Maybe we have 15. So we have some sort of dataset.

We have a whole bunch of counties.

And what we're trying to do is figure out which county belongs to which class.

So in histograms, we had to determine the number of bins and the bin width.

And those were sort of our main things we had to figure out.

And that's the same in Core Plot map classification too.

And this formula should look very familiar.

So, the high and the low is the range of our data.

So our high is 15, our low is 0.

So we've got 15 minus 0.

And then we have to decide the number of classes.

And this is really the number of colors in a map.

As we talked about a cartographic rule of thumb of somewhere between five and seven,

so if we do five classes,

they're going to have bin of size three.

And what I mean by this is,

so we're going to have five colors.

One, two. One, two, three, four, five.

Okay, so this might be a light red to dark red.

So we're using a sequential color scheme perhaps.

And so, light red,

we'd have zero to three hospitals.

Three to six, six to nine,

Nine to 12, and 12 to 15 for example,

could be our sort of range.

And so then we would determine which county fit into which bin.

So we would have county one would be light red.

County two and three would be sort of this higher red.

So we'd have counties two and three here.

County four would fall into this bin.

And county five would fall into this bin.

And notice we have nothing in this second bin here.

And this leads us to some of the advantages the disadvantages we see.

So the advantages, this is easy to compute if you know how many classes you want to have,

easy to compute, but the disadvantage is it fails to consider how data are distributed.

So we can wind up with sometimes empty bins.

We can wind up with a whole lot of elements in one bin.

We can wind up with sort of strange distributions

that are hard for people to perceive the differences between.

Now, what's interesting though is equal interval

classification works really well if the data is normally distributed.

So that means if I look at my counties,

and if this is the number of hospitals from low to high,

if the number of counties follows some sort of trend like this

where I have very few counties with a small number of hospitals,

very few counties with a large number of hospitals

and sort of the number of counties is distributed here.

So the height here is the number of counties that have that set of hospitals.

If it's normally distributed,

equal interval binning works really, really well.

And we talked in previous lectures about how we can do data transformations to try to

transform data into a nice normal distribution.

And so if we can do this sort of transformation,

equal interval may be a nice thing to compute.

But we also talked about methods for grouping

data that didn't require knowing anything about the underlying distribution.

And one method we went through previously was quantiles.

And so again, we need to know how many bins we want,

how many classes we want on our map.

But the number of colors,

the number of classes,

determines the number of quantiles.

So if I want five colors on the map,

I'm going to have five quantiles.

So I calculate the quantiles and we do it just like we

did back in our module on statistical graphics.

The advantage of a quantile is it's relatively easy to compute.

We went through methods to do this by hand.

There's also methods in R. It's nice because the percentage of observations in

each class should be the same and class assignment is based on a rank order.

The disadvantage is it again fails to consider the data distribution.

Is trying to just instead fit an equal number of samples per bin.

And what it means is,

dissimilar data might wind up being placed into the same class.

So let's make up a new example for this.

So now we've got counties.

And we've got, oh let's say instead of a number of hospitals,

let's do the number of police departments this time.

So we've got county one, two,

three, four, five, six, seven.

And you know, these counties these can be

counties in Arizona or these could be counties in Missouri.

These could have names like Maricopa, et cetera.

So don't be fooled by my one through seven here.

Now for police departments,

let's say county one has three.

County two has seven.

Six. Eleven. Nine. Ten. And then,

let's say instead of being 11 here, let's say 22.

All right. So we have one county with a whole lot of them.

And so, it was the first step in calculating quantiles

as we first had ordered the data based on value.

All right. So let's put our numbers in order.

So we've got three, six.

And we're ordering this row because this is the set.

So we've got three, six,

seven, nine, 10, 11, and 22.

So we ordered the data. Let's say we're going to have five classes now.

So we have five quantiles.

So we have seven samples,

and we're going to have five classes.

So we take seven divided by five.

This is one point something.

Okay. So, basically every second bin is going to be a quantile cut-off.

So we're going to wind up with something,

maybe like this would go one, two, three, four.

So again, we don't wind up with enough data elements there to necessarily fill this in.

We want to think about the different grouping,

want to think about the different organization.

So for fun, what I want you to do is actually go

home and calculate these quantiles correctly,

where these bins go,

and think about what elements group grouped together.

Sometimes it may work out that elements like this get grouped together in quantiles.

So this might be Q 1,

Q 2, Q 3.

Let's say we just had four this time,

four map elements in Q 4.

Is this the best grouping?

Does 11 and 22 really make sense to put into the same color,

when 11 is much closer to 10 and 9?

And so this is one of the disadvantages where

dissimilar data can be placed into the same class.

So now, try to think of some other problems with quantiles as well,

and some examples where this might be challenging.

So instead of just looking at quantiles,

we're looking at equal intervals.

People have also said, "Why don't we look at some of

the underlying statistical properties?

Why don't we take a moment with the data?"

Remember our joke about

statistics and moments and how we want to take a moment with the data?

And moments are the mean,

the standard deviation, the skewness, their kurtosis.

And so what we can do is we can calculate the mean of the data and then

break up our bins based on standard deviation.

So let's assume we have five bins.

Right here in the middle of this bin, this might be the mean.

And then, we may want the width of a bin to be two sigma.

So this is the mean plus sigma.

This is the mean minus sigma, right? Then the mean.

Then this bin is one standard deviation.

This might be three sigma then.

This might be five sigma.

This might be negative three sigma.

This might be minus five sigma.

And so this lets us calculate these classes form by adding or subtracting some number of

standard deviations from the mean to look at different data representations.

The advantages are if the data is normally or near normally distributed,

this is going to serve as useful dividing point.

So we can think about a divergent color scheme perhaps.

How far above or below the mean you are.

And the legend will contain no gaps.

The disadvantage is it only works well with data that are normally distributed.

But again, remember we talked about ways where we can do

power transformations to maybe potentially get data to be normally distributed.

So far, we've talked about equal interval,

quantile, mean and standard deviation.

But remember in our example with quantiles,

we talked about how it might be strange to have this 11 next to this 22.

Well, another option is sort of maximum breaks.

The goal is to consider individual data values and group those that are similar.

So let's took another example.

So we've got counties and let's say we've got number of microbreweries. All right.

So I have one microbrewery in this county.

One, three.

And that we might have eight, nine.

One, three, eight, nine, 10,

and then 22, let's say.

And here, what we might do is try to

group these based on what we feel are natural chunks by looking at the data.

So 8, 9, and 10 are all really close together so we might consider this a class.

So I might consider this class 1.

Here, we may say 1 and 3 are close enough together,

so this may be another class.

And then 22, so we may decide we just want to make

a map with three different classes from low to high.

And so we're trying to look at adjacent values and

compute the differences between those and the largest distances,

the breaks are used for class divisions.

And instead of just looking,

the maximum breaks are actually,

we compute this by hand.

So we've already ordered the data.

Okay? So we had 1,

3, 8, nine, 10, 22.

And the distance from here to here is 2;

the distance from here to here 5;

from here to here is 1; from here to here is 1;

from here to here is 12.

How did I calculate this?

So what I'm going to do is I take all of my counties,

I order them based on the volume interested in.

So I now decide to create this data set I already ordered.

And then I look pair to pair and I subtract the values.

So 3 minus 1 is 2,

8 minus 3 is 5 and so forth.

And now I'm going to put a brake.

So I decide how many classes I want.

So if I decided I wanted three classes,

I'm going to pick the largest three differences.

So the number here is 12,

so I'm going to put a break here.

The next largest is 5,

so I'm going to put a break here.

And the next largest is 2,

so I might put a break here even.

So that I'd have one, two, three four classes.

So I don't want too many breaks.

So I want three classes.

I do 12, 5, and I've got that.

If I want four classes,

I select the next break is 2,

so then that's how my elements break out.

So this might be my light red to my dark red color for example.

And so that's the maximum break method.

Natural breaks are when we examine it by hand and

try to logically determine breaks within the data.

So our goal is to maybe minimize the differences between

data values in the same class and maximize the differences.

So we're trying to take into account natural underlying structures.

This is subjective though.

Since the mapmakers sort of doing this by hand,

different mapmakers may choose other values.

But you can see a lot of this is in

our maximum natural breaks is maybe an optimization problem.

Here, we're trying to optimize the distance between two different map elements,

and so we can think about maybe even applying some of

our data mining classification methods

to try to split this into a certain number of classes.

And this optimal classification is similar to

natural breaks but we're trying to minimize an objective function.

So we create some measure error.

We have Jenks-Caspall, Fisher-Jenks for example.

What we're trying to do is maybe calculate the median and the number of sums above

the standard deviation or try to come up with some other optimization criteria.

And this allows us to then create our own formula for this classification.

And so this is one of the other major class of classification algorithms and hopefully,

you're starting to see the connection between histograms,

data mining, classification methods we've shown

computer optimization for trying to decide essentially how can we make

a chloropleth map that has a nice spread between colors and

values to allow us to perceive different regions and areas within the data.

Advantages of optimal classification include that they are

good empirical method for grouping data that can

assist in determining the appropriate number of classes,

but often,it's hard to explain to novice users and we still may

wind up with gaps in the map legend and we really don't want that to occur.

So these are some of the different map options.

And with optimal mapping,

we can also then even think about adding

different spatial constraints and we can even do this with Jenks natural breaks as well.

So what do I mean by adding different spatial constraints.

Well, let's think about a simple map maybe something

looks like this where these are different counties that I've shown here.

And let's say that I've created a three-class legend,

light red, red, and dark red.

Okay? So I could make this is dark red,

light red, mid red,

mid red, and mid red.

Now remember, these are my classes.

So I could have some values here like 1,

3, 4, 9, and 10.

And somehow, my classifier has put the 10 into this bin,

4 and 9 in this bin,

and the 1 and 3 here.

But 4 is really close to 3,

and if this is value 4,

if I would have made this light red,

I'm going to start seeing them spatial cluster of light red on the map.

But if I don't take into account my spatial constraints,

I could wind up with some classifier that gives me this as my result.

Does that necessarily make sense?

Is that the message I'm trying to show on my map?

And so I can set up different optimizations that also look at neighboring values to

try to determine their role that they might play in the underlying classification.

And really, with chloropleth maps,

we also need to consider standardizing the statistics.

So, for example, in the alfalfa crops,

I showed the number of crops being harvested,

but I didn't really normalize.

I did by number of acres.

But the number of acres out West in a county like this,

we have way more acres available.

It's a much bigger county than say a tiny county in Rhode Island.

So should I have been dividing this by population or by area?

It depends on what I'm trying to show.

It depends on if I'm just trying to show magnitude,

if I'm trying to show a percentage so I can say what area of

the country has a higher percentage of alfalfa crop by area.

Here, I'm looking at the percent population residing in urbanized areas.

So if I only showed the population,

then regions like Chicago,

New York, and LA can become dominant.

But here, if I the percent population residing urbanized areas,

then we can start seeing other sorts of trends and regions on the map,

other critical areas where we've seen high population density in urban areas.

And we also need to consider what we call the modifiable areal unit problem which is

a source of statistical bias occurring when data is aggregated into districts.

So we need to be really careful about thinking about what we're showing on the map,

should we have any underlying denominator,

and what sort of message do you get to be trying to

show with our map regarding different datasets.

We also have the ecological fallacy to contend with,

where inferences about individuals are based solely upon

aggregate statistics collected from the group to which those individuals belong.

So the ecological fallacy is assuming that individual members of

a group have the average characteristics of the group at large.

So when we're showing things like average values,

we may be misrepresenting what's going on in the underlying data.

And group characteristics do not necessarily apply to individuals with that group.

Imagine that we have some data distribution like this and I've interviewed the group

and part of the group is very poor and part of the group is very

wealthy and I'm showing median household income.

Well, the median household income in that area would look like this.

It would be somewhere in the middle between those that we chose the mid average region.

This area might be very interesting because of this split in distributions.

And so the group characteristics do not necessarily apply to individuals.

So we have to be really careful in thinking about how we create these chloropleth maps.

What are our color scheme choices?

Do we want to show some difference from zero.

So in this example, we have our divergent color scheme.

Should it be sequential?

Should it be qualitative?

And what sort of classification do we want to use?

Here, we talked about equal interval, quantile, maximum breaks,

optimal breaks that we can even think of

other classification methods we've learned from data mining to apply to these as well.

And there are tons of different map classifications and work on

which classifications are most easy for users to perceive.

And a lot of times, we're able to use some of

the simpler classifications like equal interval if we have

some underlying normal data distribution

that we may be able to get by transforming the data.

So hopefully,from this module,

you're able to understand how we can apply some of the techniques we've been