这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，学习模式发现深入的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

Loading...

来自 伊利诺伊大学香槟分校 的课程

Pattern Discovery in Data Mining

119 评分

这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，学习模式发现深入的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

从本节课中

Module 3

Module 3 consists of two lessons: Lessons 5 and 6. In Lesson 5, we discuss mining sequential patterns. We will learn several popular and efficient sequential pattern mining methods, including an Apriori-based sequential pattern mining method, GSP; a vertical data format-based sequential pattern method, SPADE; and a pattern-growth-based sequential pattern mining method, PrefixSpan. We will also learn how to directly mine closed sequential patterns. In Lesson 6, we will study concepts and methods for mining spatiotemporal and trajectory patterns as one kind of pattern mining applications. We will introduce a few popular kinds of patterns and their mining methods, including mining spatial associations, mining spatial colocation patterns, mining and aggregating patterns over multiple trajectories, mining semantics-rich movement patterns, and mining periodic movement patterns.

- Jiawei HanAbel Bliss Professor

Department of Computer Science

[SOUND] Now, I'm going to introduce

an interesting algorithm called GSP,

that's Apriori-Based Sequential Pattern Mining mass search.

This mass search is pretty simple.

You start from a sequence database.

Then you first try to get the singleton sequences like the first one appears.

That means you scan database once, you count every single

item that occurred in every sequence.

Then you can see a occurs three times, occurs five times.

So if you said min_sup = 2,

likely g and h will be gone because their support is only 1.

Then you find frequent lens one subsequent a, b, c, d, e, f.

With this you can combine them as candidate

sub-sequences, you may have aa, ab, ac, ba, bb,

bc remember, aa is still important means, you first buy aa then, buy another a.

Then, for the, for the shopping basket, you may get ab together.

That's why you may have ab is one event, one element, ac is another element.

You may get these kind of subsequences.

In total, you look at six by six, then you get it no six by five divided by two,

you will get this number of candidate size two sequence.

But without Apriori, without this pruning, you'll get much more.

So even this minimal pruning still can reduce search space substantially.

So in general, we can work out the algorithm like this.

You put this one into a loop.

At the very beginning, you can scan to find length k, which is length one,

frequent sequence.

Then, based on the Apriori you can generate the length two or

length k plus one, candidate sequence.

then you can go back to check and find the real frequent sequence,

then you go back to the Apriori based candidate generation.

So generation test, you can go into the loop until no frequent sequence or

not candidates can be found.

So if you look at the execution sequence for this particular dataset,

you will find, at the very beginning you get a, b, c, d, e, f.

And g, h, will be gone.

They are painted in blue.

So then you can, from here you can go up,

get a frequent length-2 sub-sequences.

Then you can generate lengths three, length four, length five,

and here, you cannot go along anymore.

So, this is the GSP algorithm

[MUSIC]