这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，学习模式发现深入的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

Loading...

来自 University of Illinois at Urbana-Champaign 的课程

Pattern Discovery in Data Mining

156 个评分

这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，学习模式发现深入的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

从本节课中

Module 2

Module 2 covers two lessons: Lessons 3 and 4. In Lesson 3, we discuss pattern evaluation and learn what kind of interesting measures should be used in pattern analysis. We show that the support-confidence framework is inadequate for pattern evaluation, and even the popularly used lift and chi-square measures may not be good under certain situations. We introduce the concept of null-invariance and introduce a new null-invariant measure for pattern evaluation. In Lesson 4, we examine the issues on mining a diverse spectrum of patterns. We learn the concepts of and mining methods for multiple-level associations, multi-dimensional associations, quantitative associations, negative correlations, compressed patterns, and redundancy-aware patterns.

- Jiawei HanAbel Bliss Professor

Department of Computer Science

[SOUND] Now we come down to

study another interesting issue called mining quantitative associations.

What is quantitative association?

Quantitative association means some attributes

after a numerical data like age and a salary.

So how to mine such rules?

There's one way is we can do static discretization.

The reason we need to do static discretization is if you do not discretize

them, you're trying to parallelized every possible age and a salary.

You will not be able to find any interesting or

sensitive rules with sufficient support.

But if we try to say, we partition age every ten years,

or partition income every $10,000 using some predefined

concept of hierarchy, we are going to be able to construct a data cube and

we're going to be able to generate some interesting association.

But this fixed,

predefined concept hierarchy may not fit your data distribution.

For example, in the university,

likely you may want to partition the age for students.

You may say it's 18 to 20, 20 to 22, or something like that.

But for income, you may say $10,000 is one partition or low and high.

But if you go to hospital,

their age distribution you may like to say middle age or old or young.

So another way is we do clustering based on data.

That means we take every dimension, we study their distribution of the age and

income, we perform certain clustering algorithm, generate a few clusters,

and then we find the parallelized frequent pattern of each such cluster of pairs.

Then finally there's also popular ways to do deviation analysis.

That means instead of doing fixed interval,

we may do based on certain condition like gender is female.

We may find their mean or a median or something, some statistic measure,

you will find if the wage mean is substantially deviated from

the overall mean, then this could be an interesting rule.

Let's go a little further to see how to find some such deviation.

We also call extraordinary or interesting phenomena.

Usually for this we may say the left-hand side is a subset of the population and

the right-hand side is some kind of extraordinary behavior expressed

using some statistical measure which could deviate from the overall.

Then the rule, whether is true rule or is just a very exceptional case,

we need to do some statistic test, like a Z-test,

to confirm whether such kind of rule is of high confidence.

Further, in many cases,

you may even want to go deeper to get a subset of the population, for

example, not only look at the gender as female, but look at the location as south.

You may get the wage could further deviate from the overall mean,

or even from the bigger rule like a gender is female.

So that subrule could become a extraordinary subrule

associated with its super rule.

For example, sometimes you do not have the left-hand side as a subset of population,

but based on numerical data you can group them into certain intervals or clusters.

For example, the left-hand side could be if the education

has pretty many years, like 14 to 18 years, and

you will find the mean wage actually is substantially higher than the mean,

so that may form another interesting quantitative rule.

Efficient methods actually have been developed to mine such

extraordinary rules.

For example, one research paper published by Aumann and

Lindell@KDD'99 is a very interesting case study of a very interesting algorithm.

[MUSIC]