这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，学习模式发现深入的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

Loading...

来自 University of Illinois at Urbana-Champaign 的课程

Pattern Discovery in Data Mining

163 个评分

这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，学习模式发现深入的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

从本节课中

Module 2

Module 2 covers two lessons: Lessons 3 and 4. In Lesson 3, we discuss pattern evaluation and learn what kind of interesting measures should be used in pattern analysis. We show that the support-confidence framework is inadequate for pattern evaluation, and even the popularly used lift and chi-square measures may not be good under certain situations. We introduce the concept of null-invariance and introduce a new null-invariant measure for pattern evaluation. In Lesson 4, we examine the issues on mining a diverse spectrum of patterns. We learn the concepts of and mining methods for multiple-level associations, multi-dimensional associations, quantitative associations, negative correlations, compressed patterns, and redundancy-aware patterns.

- Jiawei HanAbel Bliss Professor

Department of Computer Science

[MUSIC]

Now, we study another interesting issue called mining negative correlations.

So we first need to distinguish rare patterns and negative patterns.

What is a rare pattern?

Rare pattern usually means there are some rare occurring items,

they have very low support but they are interesting.

We want to catch such patterns.

For example, buying Rolex watches.

How to mine such patterns?

We previously already discussed this.

For different item sets like for those rare items, we should be able to

set some individualized group based and minimum supports threshold.

That means for rare patterns for just those items,

we should set a rather low minimum support threshold,

then we'll be able to capture such patterns.

But negative patterns could be another very different one.

Negative patterns is those patterns that are negatively correlated.

That means they are unlikely happen together.

So for example, if you find some customer, the same customer,

who buys Ford Expedition, which is a SUV car, and

also a Ford Fusion, a hybrid car, together.

So they are unlikely to happen together, so

we called these patterns negative correlated patterns.

The problem becomes how to define such patterns?

We may have one support-based definition like this.

We say, if the itemsets A and B getting together their

support is far less than sup(A) x sup(B),

that means a chance to get together is far less than random, okay?

Then we can say A and B are negatively correlated.

Is this a good definition?

Actually, this definition may remind us the definition of lift.

Then we may see whether they work well for large transaction data sets.

Let's look at one example.

Suppose a store sold two needle packages, A and B 100 times each,

but only one transaction containing both A and B.

Then we will see these two needle packages A and B are likely negatively correlated.

But when there are in total only 200 transactions in your datasets,

you may see s(A U B) getting together,

because they got only one time, so 1 over 200, you get this number.

This is pretty small number.

But then he look at s(A) which is 100 over 200 transaction so it's 0.5.

Same as s(B).

So their product should be 0.25.

So this number is far bigger than this.

That means s(A U B) getting together is far less than s(A) x s(B).

So we can easily say A and B are negatively correlated,

they are negatively correlated patterns.

Okay, but when this store, so in total 10 to the power of 5,

that means 100,000 transactions.

Then suppose all the others does not contain package of A nor B.

Then the situation could be completely different,

because s(A U B) together is 1 over 10 to the power of 5.

But s(A) now is 100 over 100,000, so you get 1 over 1000.

s(B) is also 1 over 1000,

when they time together you get 1 over 10 to the power of 6.

This number is even smaller than A and B getting together.

You may say, A and B getting together is very frequent or

it's passive correlated, actually it's not.

What's the problem?

The problem actually is null transactions.

Because there are so many transactions that contain neither A nor

B, they are null transactions.

So we probably can see a good definition of negative correlation

should take care of the null invariance problem.

That means, when two itemsets A and B are identical related,

they should not be influenced.

Okay, whether they are an negative correlated or not,

they should not be influenced by the number of null transactions.

Okay, now we give you another interesting definition,

which is a Kulczynzki measure-based definition.

That means if we want to say A and B whether they are negative correlated,

what we need to see is A and B are frequent.

But the condition the probability of A under condition of B and

the probability of B under condition of A, their average should be less than epsilon.

Where epsilon is a small negative pattern support threshold.

Then we probably can see A and B negative correlated can be justified for

our needle package problem.

We can see no matter there are in total 200 transactions, or 100,000 transactions,

if we say epsilon is 0.01, we probably can see this

Kulczynski measure based judgement, we can easily see the average of the conditional

probability should be less than epsilon, so they are negative correlated.

So this seems to be very interesting and a good definition.

And how to mine them, actually these are the method similar to

our previously discussed pattern mining method.

We will not discuss it further.

[MUSIC]