这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，学习模式发现深入的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

Loading...

来自 伊利诺伊大学香槟分校 的课程

Pattern Discovery in Data Mining

119 评分

这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，学习模式发现深入的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

从本节课中

Module 1

Module 1 consists of two lessons. Lesson 1 covers the general concepts of pattern discovery. This includes the basic concepts of frequent patterns, closed patterns, max-patterns, and association rules. Lesson 2 covers three major approaches for mining frequent patterns. We will learn the downward closure (or Apriori) property of frequent patterns and three major categories of methods for mining frequent patterns: the Apriori algorithm, the method that explores vertical data format, and the pattern-growth approach. We will also discuss how to directly mine the set of closed patterns.

- Jiawei HanAbel Bliss Professor

Department of Computer Science

Let's first introduce some basic concepts: frequent patterns and association rules.

We first look at this simple transaction example.

There are five transactions,

10 to 50 are the transaction IDs,

and these are the sets of items they bought.

For example, transaction ID 10 contains beer,

nuts and diaper, which form an itemset because it is a set of items.

And, for this particular line,

it is 3-itemset because it contains three items.

And for each itemset,

you may have a concept of support.

Support means, in this transaction dataset,

how many times beer happens.

In this particular case,

there are three occurrences of beers in this transaction data,

so this support count of beer is three.

But you also can use relative support;

that means the fraction of the transactions.

For example, there are a total of five transactions,

three of them contains beers,

so the relative support is 3/5 or you can say 60 percent.

So, we may see whether an itemset X is frequent or not if X,

the support of X,

pass a minimum support threshold.

For example, if we set the minimum support threshold is 50 percent,

then we can see the frequent 1-itemset in this datasets,

you will find they're four.

Like beer, you can see they're three cases,

this absolute support is three,

the relative support is 3/5, is 60 percent.

OK. But for frequent 2-itemsets,

you may check it.

There's only one.

OK. Because it's a beer and diaper,

they happen together 3/5,

that's why we get this.

But none of the other, you know,

itemsets they pass this 50 percent threshold,

so there's only one.

And from the frequent itemsets,

we can introduce an interesting rule,

association rule that implies,

for example, X implies Y simply says if people are buying X,

what is the support and confidence people will buy the itemset Y?

OK, then S is the support,

which is a probability X and Y contained together in this rule set.

OK. Then c is a confidence,

which is conditional probability.

That means if the transaction contains X,

what is the probability it also contains Y?

So, for this probabilistic computation,

you can use support X union Y divided by support X.

That means take the whole rule support,

divide by the left-hand side.

You may see this notion X union Y.

This is a set union.

But if you look at the Venn diagram,

actually transactions containing X could be this part; transaction containing diaper,

like beer, is this part of diaper, is this part;

containing both actually is their intersection,

OK, the intersection of the events.

OK. But from the itemset point of view,

X is the transactions containing both X and Y,

both beer and diaper.

You will say this one were count,

that's why we use not X intersects Y but X union Y. OK.

If you think X is beer, Y is diaper,

X intersects Y will be empty,

X union Y means it contains both.

Then, for association rule mining is actually try to

find all such rules which pass a minimum support and a confidence threshold.

We already know if we set minimum support is 0.5,

we're find these are the frequent 1-itemsets,

these are the frequent 2-itemsets.

From here, if we set minimum confidence is 50 percent,

we're going to derive two association rules.

Because these two rules,

if you use this computation,

beer and diaper getting together is three,

in here, then it's 60 percent because 3/5.

Then, if every time beer occurs,

you can see diaper also occur.

That's why the confidence is 100 percent.

But for diaper implies beer,

you can see the diaper support is four,

beer support is three,

so that means there's only 75 percent of

probability the customer buying a diaper likely will buy beer.

Are there more rules in this one?

Actually if you check this,

because this is only frequent 2-itemset,

these two are the only associated rules that can be generated from this transaction data.