这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，学习模式发现深入的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

Loading...

来自 University of Illinois at Urbana-Champaign 的课程

Pattern Discovery in Data Mining

192 个评分

这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，学习模式发现深入的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

从本节课中

Module 1

Module 1 consists of two lessons. Lesson 1 covers the general concepts of pattern discovery. This includes the basic concepts of frequent patterns, closed patterns, max-patterns, and association rules. Lesson 2 covers three major approaches for mining frequent patterns. We will learn the downward closure (or Apriori) property of frequent patterns and three major categories of methods for mining frequent patterns: the Apriori algorithm, the method that explores vertical data format, and the pattern-growth approach. We will also discuss how to directly mine the set of closed patterns.

- Jiawei HanAbel Bliss Professor

Department of Computer Science

Now, we are going to introduce another set of important concepts,

closed patterns and max-patterns.

Actually, in association rule mining or frequent pattern mining,

there is one challenge is,

in many cases, we may generate too many frequent patterns.

Usually, a long pattern will contain a combinatorial number of sub-patterns.

Just give you one simple example.

Suppose we get a transaction database.

It contains only two transactions.

T_sub_1 and T_sub_2.

For T_sub_1, it contains only 50 items, a_sub_1 to a_sub_50.

For transaction t_2, it contains 100 itemsets,

a_sub_1 up to a_sub_100.

Suppose we set the minimum support,

the absolute minimum support is only one.

That means everything is frequent.

Then, how many frequent items it contains? Let's have a try.

So, for frequent 1-itemsets,

we have a_1 occurs twice, a_2 occurs twice,

up to a_50 occurs twice,

then a_51 occurs only once up to a_100 occurs only once.

Then what about 2-itemsets?

Actually, 2-itemsets is all the possible combinations of a_1 to a_2,

a_1 to a_3, up to a_99 to a_100.

You probably can see some are support two,

some support is only one.

OK. Then we can go on and on up to 99-itemsets,

we again get a_1 to a_sub_99,

the support is one.

We can chop one by one,

so finally you get a_2,

a_3 to a_sub_100, support is one.

Finally, we're get a 100-itemset,

where support is only one.

If we add all these together,

that means 100 choose 1, 100 choose 2,

up to 100 choose 100,

we would get a huge number,

2^100 -1, that many sub-patterns.

This is a huge set.

For any computer, it's too huge to compute or store.

Then, how can we handle such a challenge?

One interesting proposal is introduced a concept of close patterns.

What is close pattern?

We say X is closed if X is frequent,

and there exists no super-pattern Y with the same support as X.

Just give you a simpler example.

We look at the same transaction database,

like T_1 and T_2.

We still say minimum support is one.

So how many close pattern does this transaction contain?

So instead of 2^100 - 1,

we actually find only two,

a_1 to a_50, support is two;

a_1 to a_100, support is one.

You may say, wait, you may lose something. Let's look at it.

Do you know a_1 to a_49?

Actually, it's contained here;

it must be support of two.

Do you know a_1 to a_51?

It contains here, support must be one.

So you can see you do not lose anything because you

got the concrete patterns which are the maximum close to one.

In the meantime, you do not lose the support information for any sub-patterns.

So to that extent,

we can say closed pattern is a lossless compression of frequent patterns.

It does reduce the number of patents but does not lose the support information.

For example, you will still be able to say a_sub_2 to a_sub_40,

the support is two.

a_5, a_51, this itemset support is one.

Then, let's look at another possible compression called max-patterns.

Max-pattern, the definition is very similar except it does not care the support anymore.

That means X is a max-pattern if X is frequent

and there exists no frequent super-pattern Y which is a super-pattern of X,

and we don't care support anymore.

So the difference from the closed pattern is we do not care

the real support of sub-patterns of a max-pattern.

Let's look at the transaction database like this.

OK. We still say minimum support is one,

then how many max-patterns does this transaction database contain?

So we will find only one.

The reason is, even a_sub_1 to a_sub_50,

the support is two, but remember,

we do not care support anymore;

for this a_1 to a_sub_50,

it does exist, a super-pattern like this,

OK, which is also frequent.

To that extent, this is a further compression,

so you only get a one pattern,

but it is a lossy compression because we only know a_sub_1 to a_sub_40 is frequent,

but we will not know the real support of a_sub_1 to a_sub_40 and many more.

So, you do lose a support information of many frequent itemsets.

OK. To that extent, in many applications,

mining closed patterns is more desirable than mining max-patterns.

We're going to see this in our future discussions.

So for this lecture,

we mainly give you four recommended readings.

The first one, actually,

is the first time introduced,

frequent patterns and association rules.

The second one is the first time introduced, max-patterns.

The third one is the first time introduced, the closed patterns.

And the fourth one actually is an overview,

give you overall, you know, this field,

how the frequent pattern mining came and how it evolves up to this stage. Thank you.