这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，深入学习模式发现的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

Loading...

来自 University of Illinois at Urbana-Champaign 的课程

数据可视化

574 个评分

这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，深入学习模式发现的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

从本节课中

Week 3: Visualization of Non-Numerical Data

In this week's module, you will learn how to visualize graphs that depict relationships between data items. You'll also plot data using coordinates that are not specifically provided by the data set.

- John C. HartProfessor of Computer Science

Department of Computer Science

[MUSIC]

So principal component analysis is based on linear algebra,

on eigenvectors and eigenvalues of matrices.

You don't necessarily have to understand those details in order to use

Principle component analysis.

If you have a package that will perform principle component analysis then it will

just convert the data from a high dimensional form into a low dimensional

form that you can display.

So as we lay out data, and look at ways of laying out data based on relationships.

It's useful to look at some other techniques for laying out data.

Even if the data does have variables, or

attributes, or coordinates associated with it.

So, one of these techniques is called principal component analysis.

If I have a data set, like this one,

each data value has an x coordinate and a y coordinate.

But you can really see the main variance of the data is happening along

this diagonal direction, and

the data isn't varying much perpendicularly to that direction.

So the x coordinate doesn't entirely capture the data, and

the y coordinate doesn't entirely capture the variance of the data.

the variance is actually happening in a different direction.

So we can look at the variance of the data as being

the actual dimension where things are changing along the data.

And it may not be exactly the x direction or the y direction.

It may be a combination of them.

And as we look at higher dimensional data,

we may find that the data is varying in some direction, that's a combination of X,

Y, Z, W...however many dimensions you have.

Principle component analysis helps us find these directions where the data is varying

a lot.

And also the directions that the data is varying a little bit.

And then we can reorient it so

that we can just display the dimensions where the data's varying a lot and

ignore the directions that the data is varying a little bit.

And then align those dimension with the displayed dimensions.

So in order to perform principle component analysis, we have to analyze the data.

First we have to take the data and then we have to subtract the mean.

So for each of these data items we're going to add them up, divide by the total

and that'll give us the, by the total number and then that'll give us the mean.

We subtract that from all the data.

That centers the data about the origin,

and then we're going to compute this covariance matrix.

So it's a big matrix, if our data elements are four dimensions,

X, Y, Z and W then the covariance matrix would be a four by four matrix.

It'll have this many dimensions by this many dimensions.

And it's just the data items, after you've subtracted their mean,

multiplied by the other coordinates of the data items and then averaged.

And that covariance matrix will yield a symmetric matrix because, for

example, yi times xi is going to be equal to xi time yi.

And so you can think of a matrix, a large matrix, as

something that transforms a sphere in high dimensions into some kind of ellipsoid.

It's going to stretch it out in a certain direction.

And when you square that matrix, multiply it times its transpose,

it's going to create a square symmetric matrix similar to that covariance matrix.

When we pull out the Eigenvectors and the Eigenvalues of that matrix,

these are directions that, when you multiply the matrix times that direction,

it's the same thing as just multiplying it by a scalar value.

So these are directions where if a matrix is stretching a sphere

into an ellipsoid, so the directions of the principle axis of that ellipsoid.

And so if you have a lot of variance, you'll pull out an eigenvector in this

direction that's stretching the sphere out into a long ellipsoid.

There may be two principal directions in which case,

that sphere gets stretched out into this cookie shape.

Principal component analysis finds those eigenvectors, the direction where

the data is varying the most and then, it ignores

the small eigenvectors, the directions where the data is varying the least and

that way we can display the two or three most varying

directions of data without displaying all of the dimensions of the data.

So here's an example.

I worked with a group of people a few years ago.

On a method for computing the geometry

of a scaffolding to support bone growth in damaged mandibles.

And so we needed some geometry for

a mandible in order to figure out healthy mandible, and a damaged

mandible to be able to do a subtraction operation to find this damaged piece.

So if we don't have an undamaged mandible, how could we find a mandible that

was the shape of the damaged mandible before the damage happened?

And the way we did this was we scanned in a bunch of these mandibles and

we specifically measured the shape of these five data points.

And so we had this big database of 40 some mandibles, and

each data point, each of those 40 data points,

was 15-dimensional because each of these three data points was three-dimensional.

So five three dimensional data points gave me 15 dimensions.

So I could describe the shape of a mandible as a 15 dimensional point.

We then perform principal component analysis to find three main variations.

Three of the directions in this 15 dimensional space

where these mandibles were changing the most.

And they were kind of this direction here that pulls the jaw down,

this direction that causes the jaw to kind of extrude in the front, and

then a direction that caused the jaw to move to the side.

And so based on that, we can plot all of our data points and

this is an example of the mandible, all the different data sets of our mandibles.

Plotted as a single, just overlaid on top of each other.

Each one of these corresponds to a position in this space of mandibles.

And the three dimensions here that I'm visualizing these mandibles in

corresponds to a direction, a variation

that I found from that principal component analysis, each of those three dimensions.

So each of these points corresponds to a point in a 15 dimensional space, but

I'm plotting them in a 3 dimensional space according to

the direction of their highest variance,

as represented by the eigenvector of that covariance matrix.

And so that performs a form of dimensionality reduction,

where we can take high dimensional data and plotted in a low-dimensional space for

visualization by basically focusing on the dimensions where we're seeing variation

and ignoring the directions where we're not seeing variation in the data.

So in order to implement multidimensional scaling, you need to do an optimization.

You don't necessarily have to code up your own optimization to do that.

For example when I computed the multidimensional scaling of the cities

in my example, they were computed in Excel using an optimization package I enabled.

So you don't have to write an optimization algorithm from scratch.

You can just use an existing optimization program.

[MUSIC]