这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，学习模式发现深入的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

Loading...

来自 University of Illinois at Urbana-Champaign 的课程

Pattern Discovery in Data Mining

155 个评分

这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，学习模式发现深入的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

从本节课中

Module 2

Module 2 covers two lessons: Lessons 3 and 4. In Lesson 3, we discuss pattern evaluation and learn what kind of interesting measures should be used in pattern analysis. We show that the support-confidence framework is inadequate for pattern evaluation, and even the popularly used lift and chi-square measures may not be good under certain situations. We introduce the concept of null-invariance and introduce a new null-invariant measure for pattern evaluation. In Lesson 4, we examine the issues on mining a diverse spectrum of patterns. We learn the concepts of and mining methods for multiple-level associations, multi-dimensional associations, quantitative associations, negative correlations, compressed patterns, and redundancy-aware patterns.

- Jiawei HanAbel Bliss Professor

Department of Computer Science

We have learned support and confidence.

These two measures are not sufficient to describe association.

So the problem becomes what

additional interesting measures are good enough to describe their relationships?

So that's the reason we want to examine a little more like lift and

chi square whether they are good enough to describe additional interesting measures.

So lift has been properly used in statistics as well.

We look at the same table,

the same table we can think B means playing basketball,

C means eating cereal.

So we have the exact same distribution.

Then for this continuous table,

we use lift to compute it.

The lift is defined as this: B and C are two item sets.

For rule B implies C,

that confidence if its divided by C support, we get lift.

Or we can say if BC this lose support divided by B support times C support.

So for this lift,

the general rule is if the lift is one,

then these two items are independent.

If it's greater than one,

they are positively correlated.

If it is less than one,

they are negative correlated.

For our example data set,

we will calculate a lift of B and C and B and not C. We divide 0.89 and 1.33.

Then from those data sets and the rules we've broken C,

B and C should be negative correlated because the lift is less than one.

B and not C are positive co-related because the lift is greater than one.

This actually fix our problem because we know B and C should be negative correlated,

B and not C should be positive correlated.

So this looks very nice.

Let's look at another measure popularly used in statistics as well called chi square.

In chi square, the definition,

we need to calculate the expected value.

How to calculate the expected value?

If we can see this 400 is a real value, it's observed value.

But expect value is just based on the distribution.

For example C and not C the distribution is 700 over 250.

This is three to one and all 600 students with three to one you get 450 versus 150.

In that case, we probably can't see,

we still can't use the popular,

the rules like if chi square is zero, they are independent.

It's greater than zero, they are correlated either positively or negatively.

So we need additional test to see whether they are positively or negatively correlated.

Now for our example,

we can easy calculate chi square should be almost 76.

So B and C should be correlated.

Further, we can say they are negatively correlated because the expected value is 450.

The observed value is only 400. It's less.

So these teams can solve the problem as well.

But the problem becomes whether lift and chi square are good in all the cases.

Let's examine some interesting case.

In this case, you probably can see this not B not C actually is quite big.

There are 100000.

These actually called null transactions because the transactions contain neither B nor

C. And if we just look at a B and C relationship,

we first see B and C should be negative

correlated because it's not easy to get B and C together.

B and not C is far bigger.

C and not B is also far bigger.

But if we use a lift,

we compute a lift B and C,

we will get this 8.44 which is far bigger than one.

That shows B and C should be strongly positive correlated.

This seems not right.

Either we tried to use this same contingency table.

We add the expected value.

We do the computation.

We will find chi square is bigger than zero.

In the meantime, you observed value is far bigger than the expected value.

So we also should say B and C are strongly positively correlated.

This seems to be wrong.

What's the problem?

Actually, there are too many null transactions.

That may make things distorted.

We need to fix it.