这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，学习模式发现深入的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

Loading...

来自 伊利诺伊大学香槟分校 的课程

Pattern Discovery in Data Mining

119 评分

这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，学习模式发现深入的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

从本节课中

Module 3

Module 3 consists of two lessons: Lessons 5 and 6. In Lesson 5, we discuss mining sequential patterns. We will learn several popular and efficient sequential pattern mining methods, including an Apriori-based sequential pattern mining method, GSP; a vertical data format-based sequential pattern method, SPADE; and a pattern-growth-based sequential pattern mining method, PrefixSpan. We will also learn how to directly mine closed sequential patterns. In Lesson 6, we will study concepts and methods for mining spatiotemporal and trajectory patterns as one kind of pattern mining applications. We will introduce a few popular kinds of patterns and their mining methods, including mining spatial associations, mining spatial colocation patterns, mining and aggregating patterns over multiple trajectories, mining semantics-rich movement patterns, and mining periodic movement patterns.

- Jiawei HanAbel Bliss Professor

Department of Computer Science

We will discuss Mining Spatial Colocation Patterns.

What is spatial colocation pattern?

Usually, a group of spatial features of events

can be frequently colocated in the same region.

For example, West Nile Virus may often occur in

the regions with poor mosquito control and the presence of birds.

So if we can find such patterns,

we may have better way to control the spread of the virus.

Let's look at the example on the right.

So we have A, B, C, D as four types of objects.

You can think A could be West Nile Virus, spread in region.

B could be some mosquito region.

Or C could be in some cities, D could be the presence of birds, something.

Then we use the address, means they are colocated.

For example, 3, 6, 17 are co-located, we used edge to link them together.

And a 4, 7, 10, 16 are co-located together.

We use the edge, linking them together.

Okay, like 2, 9 could be collocated, but on the other hand a 2,

14, 18, 11, and 15, these are collocated regions, okay?

Then we may be able to find rowset.

The rule set is this, okay?

If we want to find A, B, C, D, how many things are,

how many objects are colocated together as A, B, C, D for types.

When I be able to find four, seven, 10, 16 are colocated.

And 2, 11, 14, 15 are co-located or 8, 11, 14, 15 are co-located.

That means A, B, C, D we may find two,

actually three instances they are co-located.

That's they are all set.

On the other hand, A, B may be a little more.

For example, 5,13 is co-located, 7,

10 is co-located, so we may find more instances.

Then we may get a colocation rule like this.

If pattern A is colocating, whether pattern B may also colocating.

We may find this conditional probability of the colocation patterns.

That means if we want to find AB may imply CD, that means if AB are colocated,

likely CD will be colocated with certain probability.

Then, we need to compute the pattern A, at what condition,

in how many cases, when A is there, and AB is also there.

That means Actually AB, the cases is usually is the case.

You can find A and in the row AB set, rowset.

Okay, essentially what we need to calculate is A, B, C, D is a rowset.

You are probably going to see there are three such cases A, B, C, D together.

But there are four cases A,B together.

So the condition, if you find the A, B then you find the C, D, colocated.

The condition equals 3 over 4 is 75%.

The conditional probability for this rule is 75%.

Then the interesting thing is,

can we find such colocation patterns and rules automatically and efficiently?

Okay.

So let's see how we can derive an efficient algorithm to find them.

To find efficient algorithms,

we first introduce another concept called participation ratio.

That means that the feature f is participating in this pattern c.

What's the probability?

Okay.

So for example A participating in A, B, C, D, you're broken in C.

A is the red one.

We have one, two, three, four, five.

Five cases you'll find A.

But every time we find A We can find a, b, c, d.

There are only two such cases like 7 and 14.

That's why we get a 2 over 5 is the participation ratio.

Similarly, if you want to find D participating in A,

B, C, D, d It actually is the blue circle.

You can see there are only two blue circles, but

both of them are completely partitioned, participating in this ABCD pattern.

This simply says you get cases two over two is 100 percent.

Then we may derive a interesting property

called monotonicity property of participation ratio, okay?

That means suppose we have c and c prime as two core location patterns.

But c prime is a subset of c.

Then for every feature f, if feature

is participating in C prime, okay?

Then the probability will be higher than participating in a superset.

This is quite easily, can be an either and or

[INAUDIBLE] because you can think about this, if A participating in A, B then

the chance is higher than A participating into, in A, B, C, D.

Because every time when A, B occurs.

ABCD also will be.

When A, B, C D occurs, a, b will also be there.

That's why when f participating in c for sure the probability or

the participation ratio will be higher then participating supersect.

So based on the monotonicity, we can work out an Apriori-like algorithm

to efficient mine colocation patterns.

Let's give such an example.

Suppose minimum feature support is riveting a sigma,

and a minimum participation ratio is rewritten as roe.

Then we start from a set of single feature patterns.

If they are frequent, that means if their support is no less than sigma.

Okay.

Then, we can keep using upper array like a principle to find their pairs.

That mean from the single feature pattern try to find two feature patterns,

three feature patterns grow up to size k.

Anytime we can stop if the pattern is frequent to anyone

Because once it's not frequent, it's super patterned will not be frequent.

This is simply based on the Apriori principle.

Then based on these participation in ratio monotonicy we will be able to find

all the super patterns of a single feature or the multiple features.

Suppose we got a feature Is p,

we want to find it's super feature, okay nor super-pattern.

And we know we can get a little

bigger super-pattern to see whether it's greater in row.

Anytime if it's not greater in row, it's super-pattern would not need to be

examined because they cannot be bigger than row anymore.

So based on these Apriori principle, we can easily work out efficient algorithms.

So that's the trick or

that's the interestingness if we can find a monotonicity pattern,

we will be able to work out, Apriori-like efficient processing algorithms.