Covering the tools and techniques of both multivariate and geographical analysis, this course provides hands-on experience visualizing data that represents multiple variables. This course will use statistical techniques and software to develop and analyze geographical knowledge.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

Huan Liu

Professor: Computer Science and Engineering School of Computing, Informatics, and Decision Systems Engineering (CASCADE)

In this module, we're going to talk about what we call spatial scan statistics.

So, in previous modules,

we talked about spatial auto-correlation for identifying sort of clusteringness,

and whether the regions might be auto-correlated.

But we also want to be able to determine whether spatial events within a data

set could have occurred in this pattern by chance or by random.

And so spatial scans statistics are a method for determining whether

measures occurred in this geographic distribution by accident or not.

What I mean by that is really,

how likely was it that this pattern could have occurred by random in nature?

So what I mean by this is,

imagine here that I have a map on my screen.

So I've got roads,

and counties, and buildings,

and things, and each point here is a measurement.

It could be a measurement by county,

but in this case, let's think of this as illness.

So some people got the flu and went to the doctor.

Some people thought they got the flu and went to the doctor,

but they didn't have it they just had a bad cold.

So the triangles are people with the flu,

for example, and the control

are other people that the doctor saw that didn't have the flu.

So this is just our population of people that went to the doctor.

And these are their home address as to where they live.

And what I want to determine is,

if this chunk of people could have occurred by chance or not.

How likely is this pattern to have occurred in reality?

We could think about this as crime.

What if I have a crime spree?

And how do we determine whether this crime spree is occurring by chance or not?

And scan statistics are going to essentially,

for each point, draw a window,

bigger and bigger every time,

capturing all the possible windows in the data set,

and see if the patterns of

distribution of points in those windows could have occurred by chance or not.

And so, here we start with just one of our points, one of our cases.

We don't do this for the controls.

We only care about looking if the cases are distributed.

So, for each case, I draw a window,

the circle window, that gets bigger and bigger.

And for each circle, for each window,

I compute this likelihood function.

For the Bernoulli distribution,

the likelihood function is defined as the number of cases,

the total number of population in the window.

So this is the cases in the window.

So if it's little c,

it's cases in the window.

If it's big C, it's cases total in the world.

And if it's little n, this is the population in the window,

and if it's big N, this is the population in the world.

So let's look at this example a little quick.

So in my first circle,

my little c is three,

because I have three triangles in there.

And let's say my little n is four because I have four things in there.

I've got one square and three triangles,

and then I can count everything else.

So I've got one, two, three,

four, five, six, seven, eight,

nine,10, 11, 12, 13,

14, 15, 16, 17, 18.

So big N is 18.

And Big C is one,

two, three, four, five, six, seven.

So big C is seven.

Seven cases, three within this window,

a population of four in this window,

and a total population 18.

So I can fill out my Bernoulli distribution,

and I get a value, and that gives me my likelihood value L_0.

And this L_0 is the likelihood of this occurring with this Bernoulli distribution.

And then what I do is,

I take all the data, and I randomly redistribute it.

So now I notice my cases is in controls and moved around.

So I got my original distribution,

and I take all the data,

and I randomly threw it back on the screen,

and I keep my window in the same spot though.

And now for that same window,

I calculate a new likelihood function for all the data that fell into their.

And I take my data and throw it back on the screen again,

my window didn't move, and I calculate a new likelihood function.

And I do this a whole bunch of times.

And then what I do is, I sort the list,

and I want to see sort these from low to high,

and I want to see where my original likelihood value fell,

and that sorted position in the list divided by the link to the list,

gives me the probability of this occurring by chance or not.

And so, by doing this, I can find a p-value of how likely this distribution,

my original distribution was, compared to chance.

The problem is, this can be really expensive to compute.

I have to randomly redo all of my data.

The nice thing is, every calculation is completely independent.

So if I draw all my windows at once,

and randomly redistribute all of my values,

I can calculate these in parallel really quickly.

You can actually get software to do this,

called SaTScan developed by Martin Kulldorff at Harvard,

and it'll run this data for you,

takes advantage of your parallel process on your computer.

And basically, we're keeping the window in the same location,

randomly redistributing the cases and controls,

we calculate the likelihood function for a new distribution,

and we repeat a whole bunch of times.

Typically, 999,999 times or some 9 9 9 case.

The reason we want 9 9 9,

is because we have our initial case, L_0,

plus randomly redistribute this some 9 9 9 number of times,

and we divide by,

remember we divide by the length of the list.

So that will be a nice 10,000 to divide by.

So, we get a nice number to divide by on the bottom.

Similar work is geographic analysis machine had

a similar process as well if you're interested in looking at that.

And we can also extend these scan statistics not only over space,

but we can extend this over time.

So imagine that I have my Space-Time cube,

and I have my points per day.

I can draw my cylinder window,

and I can do a similar extension for over time as well.

But this becomes really computationally expensive,

but our SaTScan software can help us calculate this as well.

But if you have a large number of points and a large number of data sets,

this probably would be intractable for your local desktop.

But this is just another method of taking a different data set,

and trying to look for are there any patterns that

may be appearing that didn't occur by chance?

Because again, we don't want people to necessarily look at their data and say,

"Oh, I think this is an interesting cluster."

We can help confirm that by running statistics and saying,

" 'Yes, this is interesting in fact.

So, why don't we give you some sort of probability value of this?'

or 'No, this is not in fact interesting.

We checked all the different possible distributions we

found this occurs by chance pretty often in this distribution.'

" So these things can again give us more insight and ways to

help people detect the expected, discover the unexpected,

helps give us ideas on where we can show different clusters,

and information in the data,

and we can start thinking about how we can combine these all together into

some sort of visual analytics package which combines machine learning,

statistics, interaction, and visualization to help people