这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，学习模式发现深入的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

Loading...

来自 University of Illinois at Urbana-Champaign 的课程

Pattern Discovery in Data Mining

155 个评分

这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，学习模式发现深入的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

从本节课中

Module 1

Module 1 consists of two lessons. Lesson 1 covers the general concepts of pattern discovery. This includes the basic concepts of frequent patterns, closed patterns, max-patterns, and association rules. Lesson 2 covers three major approaches for mining frequent patterns. We will learn the downward closure (or Apriori) property of frequent patterns and three major categories of methods for mining frequent patterns: the Apriori algorithm, the method that explores vertical data format, and the pattern-growth approach. We will also discuss how to directly mine the set of closed patterns.

- Jiawei HanAbel Bliss Professor

Department of Computer Science

[SOUND]

Hi, I'm going to introduce you another

interesting pattern mining approach called pattern growth approach.

This approach is represented by interesting algorithm called FPGrowth.

Let's look at how this algorithm works.

The general idea is first we find the frequent single items and

then we partition the database based on each such item.

Then we recursively grow frequent patterns by doing the above iteratively or

recursively for each partitioned database, also called conditional database.

To facilitate efficient processing, we use a new data structure called FP-tree.

The whole mining can be summarized as follows.

We recursively construct and a mine conditional FP-trees.

Until the result FP-tree is empty or

until it contains only one path, the single path will generate

all the combinations of its sub-paths each of which is a frequent pattern.

Let's look at a simple example.

For this transaction database, it has only five transactions.

Each contains a set of items.

So we can scan the database once.

Find the single of item frequent pattern like the following.

Suppose minimum support is three.

We will be able to find the following frequent single item.

Then we can sort the frequent items based on it's

support frequency in descending order like this.

Base on this, we can construct a tree by first construct the header, following this

order, then scan the database, we can construct this tree as follows.

For example, okay, if you look at the first transaction, it's f, c,

a, m, p, so we construct it f, c, a, m, p with only support as 1.

Then we get f, c, a, b, m, we constructed this f, c, a, b, m.

And this power support becomes 2, this maintains to be 1.

You can go one by one working on the following tree.

Then to mine this tree, we can use the divide and conquer and do it recursively.

We can do it like this, okay, for each tree we can construct this pattern base.

We call it p's conditional pattern base, m's conditional pattern base.

Let's look at how p's conditional pattern base is constructed.

So if you can see, p is the leaf of the node.

You only look up.

You get f, c, a, m that supports 2 because

p occurs together with this branch only twice.

That's why you get f, c, a, m 2.

Then for the same reason, you look at the other p,

it's cb happens only once together with p, that's why we get cb1, okay.

Then if you want to look at m's conditional database,

the philosophy you were thinking the patterns having m but not having

p because of p already being taken care of by the p's conditional pattern base.

So you look at m's conditional database, actually you also look up and

you get f, c, a, so for this branch they occur twice, that's why you get f, c, a 2.

For the same reason, you look at this m and you get an f, c, a, b 1.

So you get f, c, a, b 1.

You can do this on and on.

And with this, you can get

into transform the prefix paths of p to get the following.

Then, the remaining task which is, mind this, conditional pattern base.

For this conditional pattern base, we mine single item patterns,

we construct the FP-tree and mine it recursively.

For example, for p's conditional pattern base, you will get only c: 3,

the reason is f, a, m, and b they never occur

more than three times, so they are out, they are not frequent so you only get 3.

But for m-conditional pattern-base, you per can see f, c, a really

occurs three times, the b only occurred once, then you'll get a f, c, a 3.

And b-conditional pattern-base gives you where you can see none of them,

actually pass the support threshold of 3, so it's empty.

So let's just look at of course, you'll get a and

c's pattern-base like f: 3 and fc:3.

Actually for the single branch, you can dump all the possible combinations there,

support our tree.

But we just look in the recursive way how we can mine this pattern-base.

For this pattern-base, you probably can see, you can say for

a's conditional pattern-base you'll get this.

You'll get fc: 3, and for c you get f: 3, you can see that's the same thing.

Then for this particular am conditional FP-tree you might,

you just take a c, you say c's that means a, m, c or c, a,

m their conditional pattern-base is only f: 3, you get this.

Then you can dump all the patterns like this.

For single branch essentially that's the same thing, you will dump

all the patterns, like all the possible combinations, they are all three, okay.

You can see if you get a single prefix-path,

you actually can partition this into two parts like this path you can mine it,

and this path you can mine it and then you can just concatenate the path and results.

What about if tree cannot fit in the main memory?

If you cannot fit in the memory, we can use database projection.

That means, we project the database based on the patterns,

based on single items set, okay.

So, then we can construct and mine this tree for each projected database.

So we can have parallel projection or

partition projection, two different methods.

The parallel projection means for this one,

you will get f4 predict database and f3 predict database.

For example, for

this first stream you will see gh suppose is non frequent to only f.

F's a frequent.

You will get an f4 predict database, you'll get an f2, f3.

Okay, then, f3's predicted base, you get f2.

That's one,

and every one of these projected databases is independent of the others.

So, you can mine them in parallel.

But then, you also can have partition projection.

Partition projection the general philosophy is for each string.

You only put into one place.

For example, this one country is therefore you only put f2, f3,

but you don't put this f2 to f3 projected database.

When you're finished mine this f4, when you do this projection then you

think this one migrates to f3, you put this one into f3, okay.

So that's just a different choices for

different partition of parallel projections,

it's just a different way of implemented, thank you.

[MUSIC]