这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，深入学习模式发现的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

Loading...

来自 University of Illinois at Urbana-Champaign 的课程

数据可视化

555 个评分

这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，深入学习模式发现的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

从本节课中

Week 3: Visualization of Non-Numerical Data

In this week's module, you will learn how to visualize graphs that depict relationships between data items. You'll also plot data using coordinates that are not specifically provided by the data set.

- John C. HartProfessor of Computer Science

Department of Computer Science

[SOUND] So

a second method for dimensionality reduction is multidimensional scaling.

It focuses on the distance between related items,

as opposed to their actual positions.

So multidimensional scaling is a form of dimensionality reduction.

We previously looked at principle component analysis as a method for

dimensionality reduction.

We have high dimensional data, and

we want to display it on a low dimensional display.

On just using maybe two or three axes in order to be able to better visually

perceive the data, even if the data is high dimensional.

And so we've looked at this in terms of visualizing graphs.

Graphs encode relationships between data points where the length of the edge is

irrelevant.

But if we want to look at a graph where the length is relevant,

we might want to look at ways of plotting the nodes of a graph or data,

based on the relationship between data points, based on distance.

So that we can try to preserve the distance in, for example,

a low dimensional, a two dimensional visualization of high dimensional data.

We may want the distance between data points in two dimensions to somehow

indicate the distance between the data points in higher dimensions.

And so that sets up a metric MDS,

metric multidimensional scaling where you try to preserve the distance between

data points even though you're displaying them at a lower dimensional space.

So here's a nice example.

We're not doing dimensionality reduction because we'll go from a two dimensional

space to a two dimensional space.

But it illustrates how we would compute a multidimensional scaling to try to

preserve the distance between points.

So here's distances between cities in the US, for example.

So we've got Boston, NYC, DC, Miami, Chicago, Seattle, SF, LA, and Denver.

And so the distance between, for example, Chicago and

NYC would be 803 miles in this example.

So the lower triangle of this data is showing the distance between

the city on the left and the city in the row, and the city in the column.

And so there'd be 0 distance between a city and itself.

And this will be symmetric, so

I haven't bothered to fill in the upper triangle of this matrix.

So given those distances, dij, between data points i and j,

between city i and city j, can we recover the positions of those cities?

So I'm not taking the coordinates of those cities,

I'm just taking the distances between those cities.

Basically taking a graph where I'm specifying the length of the edges between

the nodes.

And then I want to find the positions of the nodes that satisfies

those edge lengths.

So we can do that by minimizing this function.

We basically take the distance between the two data points,

and then we want to subtract their actual distance, and we want to minimize that.

So if the distance between two data points equals the desired distance

between those two data points, then this should be close to zero.

And we square things here so that we keep everything positive so

that negative values aren't messing up our minimization.

So we can solve this with any number of non-linear optimization methods.

In the example I'm going to describe, I solved it just using

the Excel spreadsheet program with an optimizer plug-in.

So here's the results that I got,

I minimized that distance relationship between the actual

distance from my optimized variables and the desired distance.

And as you can see, things are obviously in the wrong place.

I've got the East Coast on the left and West Coast on the east, but

that should still preserve distance.

And there's a few other differences,

but basically I've got this general east to west trend.

Seattle is above SF and LA, Denver is here, Chicago is where it's at.

So I've preserved the distance of my

points even though I haven't preserved the exact geometry of the points.

And so this can be a useful layout method, especially for high dimensional data when

you're trying to preserve the distances between the points without necessarily

representing accurately their spatial configuration, their original coordinates.

So we've thrown away the original coordinates of these points.

And just using their distance relationship,

tried to preserve that in a reprojection, in this case in two dimensions.

So there's all sorts of ways you can use MDS, multidimensional scaling.

You can visualize affinities between data points,

areas of collaboration based on co-authorship of papers.

If you're visualizing papers or the number of any other attributes two data

points might have in common, anything you can describe as a distance.

It doesn't have to be coordinate distance, it can be other similarities.

It's used in human-computer interaction for

user interface design to basically lay out buttons.

If you have a dashboard with a lot of buttons, you can lay out the buttons by

organizing them based on how similar the buttons are to a given task.

And then,

let multidimensional scaling compute the coordinates of where the buttons should be

located so that buttons that are used together are close to each other.

And it's also used in marketing to create what are called perceptual maps

that are based on survey data of what people think of different products,

what attributes people assign to different products.

And then you can figure out what products are similar to other products and

create a map of products that way.

So, multidimensional scaling is enabled by optimization.

You don't have to write your own optimization program,

you can use pre-existing optimization packages.

You just need something that's going to minimize a function, and so

you need some form of non-linear minimization in an optimization package.

I found one that was enabled in Excel in order to do

the example I described in the slides.

[MUSIC]