这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，学习模式发现深入的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

Loading...

来自 University of Illinois at Urbana-Champaign 的课程

Pattern Discovery in Data Mining

133 个评分

这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，学习模式发现深入的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

从本节课中

Module 2

Module 2 covers two lessons: Lessons 3 and 4. In Lesson 3, we discuss pattern evaluation and learn what kind of interesting measures should be used in pattern analysis. We show that the support-confidence framework is inadequate for pattern evaluation, and even the popularly used lift and chi-square measures may not be good under certain situations. We introduce the concept of null-invariance and introduce a new null-invariant measure for pattern evaluation. In Lesson 4, we examine the issues on mining a diverse spectrum of patterns. We learn the concepts of and mining methods for multiple-level associations, multi-dimensional associations, quantitative associations, negative correlations, compressed patterns, and redundancy-aware patterns.

- Jiawei HanAbel Bliss Professor

Department of Computer Science

[SOUND] Now we are going to discuss another interesting

problem, mining compressed patterns.

We know frequent pattern mining may often generate many many patterns.

But in many of such patterns may share some similarity.

Maybe there is scattered, but not so

meaningful if you generate all those patterns.

Let's look at such an example.

Suppose we finally get five pattern IDs.

So you perceive these five patterns are like P1 and P2.

They are very similar and their supports are also very similar.

But P2 and P3, they are similar in item sets but

their supports are rather different.

Can we compress them?

When we first examine closed patterns, actually for

closed pattern, there's nothing you can compress.

The reason is closed pattern require the supports are identical,

then you can compress them.

But none of them have identical support counts.

So nothing can be compressed.

What about max pattern?

Of course we can use max pattern, that's P3, to represent all the patterns.

But on the other hand you probably see P3, this support,

is rather different from all the others.

To use P3 you may lose a lot of information on the support of

other patterns.

So that desired output actually is P2 P3 and P4.

The reason you probably can see is P1 and P2, they share a similar item sets and

also they share very similar support counts.

So we may need a good measure to see what things can be compressed.

A good measure could be pattern distance measure.

We use this definition to define the pattern distance.

You probably can look at a P1 and P2 when we look at this.

P1 and P2, their transaction, intersection of their transaction ID.

What should be the count?

The number of transactions they can intersect, actually it's a smaller one

here, because every transaction containing P2 must also contain P1.

So their support counts intersection should be this number, okay?

What about the unit?

The unit actually says all those transactions, because we know all

the transactions, P1 must also containing the transactions of P2.

So their unit number should be this number, okay?

In that case,we perceive these two numbers very close to one.

Then the distance 1 minus this very close to 1 number,

you get something very close to 0.

That means their pattern distance is rather small.

Based on this, we probably can see we may be able

to define a theta clustering or theta cover.

That means if we can see the pattern P,

the theta cover of pattern P is finding those patterns,

all the pattern can be expressed by this, the distance is within delta.

Actually, you probably see, P2 is a good one, because P2 essentially can cover P1,

because P1 just have a little less item sets, but their support is so close.

To that extent we may say they are within this theta cover.

So for data clustering,

we will be able to cluster P1 P2 together using P2 to represent the pattern.

So that means if we do this data clustering,

then all the patterns in the cluster can be represented by one pattern, P.

So the problem becomes whether we should mine all the patterns then compress them,

or should directly mine these compressed patterns.

Actually there's a efficient method which can directly mine those compressed

patterns.

I'm not going to get into the detail but you may refer to this interesting paper.

Okay, then another interesting thinking is Redundancy-Aware Top-k Patterns.

That means we want to get a desired pattern which is similar to the compressed

one, because want to get high significance and low redundancy.

These kind of of set up patterns, okay.

Let's look at this a, b, c, d, four different kind of compression.

Actually a is a set of original patterns.

There are cluster shields, their pattern distance, and the color

the darker shows is more significant, the lighter shows is less significant.

Okay, in that case you probably can see in this bigger cluster,

there are three patterns.

They are quite significant.

If you just do the top-k pattern mining, that means you take it as a support count,

or other significant measure, you would only find these three patterns.

Suppose we wanted only find top three, then all the remaining

patterns like here in the other cluster is completely missing.

But if you say I just do the summarization, try to find no clusters.

And within each cluster try to find their centers.

Then you'll pretty well find those less significant patterns, so

this may not be a good balance.

Actually better balance is you take care of both significance and the redundancy.

Simply says, you look at this one, there is something very significant.

And that they are also in the cluster center

you may want to show these patterns.

In the meantime, suppose you can only show three,

you may show these are significant and less redundant.

This one is significant and also it represents this cluster.

So the problem becomes how to develop efficient and

effective method finding such redundancy aware top-k patterns.

There's an interesting study which uses the max marginal

significance to measure the combined significance of a pattern and

develop efficient methods to mine such patterns.

We are not going to get into detail of this method.

Interested readers we made read the paper we pointed out.

Thank you.

[MUSIC]