这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，学习模式发现深入的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

Loading...

来自 University of Illinois at Urbana-Champaign 的课程

Pattern Discovery in Data Mining

141 个评分

这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，学习模式发现深入的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

从本节课中

Module 2

Module 2 covers two lessons: Lessons 3 and 4. In Lesson 3, we discuss pattern evaluation and learn what kind of interesting measures should be used in pattern analysis. We show that the support-confidence framework is inadequate for pattern evaluation, and even the popularly used lift and chi-square measures may not be good under certain situations. We introduce the concept of null-invariance and introduce a new null-invariant measure for pattern evaluation. In Lesson 4, we examine the issues on mining a diverse spectrum of patterns. We learn the concepts of and mining methods for multiple-level associations, multi-dimensional associations, quantitative associations, negative correlations, compressed patterns, and redundancy-aware patterns.

- Jiawei HanAbel Bliss Professor

Department of Computer Science

[SOUND] Now, we come down to compare

these null invariant measures.

So, which one is better?

We know, we can sense not all those null-invariant measures are created

equal so, we want to see which one's better in all the cases.

Let's examine the two variable contingency table of

the transactions containing milk and coffee.

Let's look at the case in this data set.

So the first, you look at D1 and D2, D1 and D2,

you probably can see the difference is only on the number of null transactions.

But you also can see very likely, milk and

coffee should get together, they should be positive.

In that sense, you can see all of those five,

null-invariant measures, they give equal value.

In this sense, no matter how many transactions on the null part.

Okay, they do not change their value.

And also they are very close to one, in the sense,

these are possibly getting together.

When you look at D3, D3 means mc getting together is quite

rare because they get along more frequent.

In that sense, all their values are very close to zero.

Then you look at the D4, D4 says mc getting together or mc alone,

they are all like 1,000 and 1,000 and 1,000 cases.

No matter how many null transactions,

they actually got things right in the middle this 0.5, 0.5.

Only Jaccard's 0.33, actually just in Jaccard,

this one means its balance is right in the middle.

Then we look at cases, it could be D5 and D6.

D5, if you see this is 1,110 solved cases.

So what you probably can see, is from coffee point of view,

like a coffee guy may say, mc are likely getting together,

because they get along as 100 cases buying coffee but not milk.

1000 cases we're buying both coffee and milk.

But for milk guy, they probably say, they are very unlikely getting together,

because I got 10,000 cases buying milk but not coffee.

But only 1000 case buying milk and coffee.

So, in that case you look at different measures.

It's interesting to see, All Confidence and

Jaccard they also say it's closer to zero, unlikely getting together.

But Max Confidence says it's close to one, they are very likely getting together.

Then we look at a Kulczynski said, I'm right in the middle,

because the tug of the war on each side is ten to one.

Then Cosine said, I'm a little prone to unlikely getting together.

Now, we change this one even more.

This is 1,000 to ten, or 1,000 to 100,000.

The coffee guys said they are very, very likely getting together.

But the milk guy said, they are very unlikely getting together.

Now in this case, you probably can see All Confidence and

Jaccard drop down to 0.01, and even cosine dropped down to 0.1.

But if Max Confidence says, I am very confident they are very close to one,

because they are very likely getting together.

But in Kulcyzynski said, I'm still in the neutral because this is 100 to one,

the other is one to 100, they have the equal ratio.

So, which one do you like?

So, probably we can see D4 to D6.

The real case is, that differentiate that five null-invariant measures.

But we probably can see, Kulcyzynski measure

holds firm when in these very imbalanced cases.

But the ratio is balanced on both sides, and it holds firm at 0.5.

That looks interesting.

But on the other hand, we also know those cases, some are very imbalanced,

we may want to introduce another measure called imbalance ratio.

The imbalanced ratio is introduced in the sense,

the support of item set A and support of item set B,

their differences play important role in this imbalance ratio computation.

Then you proceed for the same cases in the last three,

the Kulcyzynski vector holds firms at 0.5.

But the imbalance ratio, okay, D sub four cases is zero,

because they are already balanced.

And D sub five cases become 0.89, they are rather imbalanced.

And D sub six cases, it is very imbalanced.

So imbalance ratio, really can show you how balanced the two sides are.

So we feel Kulcyzynski plus imbalance ratio, these two things getting together

will present a clear picture for all the three data sets, D4 through D6.

Because D4 is neutral and balanced, D5 is neutral but

imbalanced, and D6 is neutral but very imbalanced.

Finally, we're going to show you some real data sets like a DBLP data sets,

we want to look at co-author relationships.

So, let's look at this table.

This table we got around year 2007.

We study the recent database conferences, we look at those authors,

they publish papers and they co-author papers in database conferences.

But you can probably see, the interesting thing is, for

example you look at Hans-Peter Kreigel and Martin Pfeifle.

Martin Pfeifle got 18 papers, but all of them are with with Hans-Peter Kriegel.

The Hans-Peter Krieger got 146 papers, 18 was with Martin Pfeifle.

Well, okay, you can see in that cases Kulcyzynski shows pretty strong value that

simply says, these two authors are closely tied together in some way.

But they are imbalanced as well.

We can see the case of imbalanced ratio,

you can either calculate the imbalance ratio is really high.

So in that case, you probably can easily judge Hans-Peter,

Kriegel likely to be the adviser of Martin Pfeifle.

So using Kulcyzynski and imbalance ratio,

we can easily see advisor-advisee relationships, and close collaborators.

In one research paper on finding advisor-advisees,

we are really using those measures to find them with reasonably high accuracy.

So finally, we will show you a bunch of papers.

These are the papers quite representative on how to judge the correlation

relationship, the different measures, including their interest in measure,

the non-variant ones and the measure we discussed on the Kulcyzynski.

Thank you.

[MUSIC]