这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，深入学习模式发现的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

Loading...

来自 University of Illinois at Urbana-Champaign 的课程

数据可视化

687 个评分

At Coursera, you will find the best lectures in the world. Here are some of our personalized recommendations for you

这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，深入学习模式发现的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

从本节课中

Week 2: Visualization of Numerical Data

In this week's module, you will start to think about how to visualize data effectively. This will include assigning data to appropriate chart elements, using glyphs, parallel coordinates, and streamgraphs, as well as implementing principles of design and color to make your visualizations more engaging and effective.

- John C. HartProfessor of Computer Science

Department of Computer Science

[SOUND]

So, if we create stacked graph layouts,

stacked bar charts, and stacked line graphs of a lot of dependent variables,

then you get some interesting effects as displayed here.

You start to see kind of a stream effect, and you can see the same

data variable evolving as it moves across the horizontal axis.

But you also see some other problems with this representation.

One is that the variants in the lower variables

in the stack can influence the shape of the higher variables.

And it can be harder to see when things get sheared towards the outsides,

whether they're representing the variables the same value or if it's increasing or

decreasing.

And also you can see that by the time you get to the very top of the graph,

it's changing quite violently because it's the sum of all the changes that have

happened below it.

We can analyze this by using a few variables.

If we let each variable at a given horizontal position on the horizontal

axis, each variable be represented by say, f1 for the first variable,

f2 for the second variable And so on, f1, f2, f3.

Then we can define, basically, ground zero, g0, to be the baseline level.

And in this case, we're just setting g0 to be 0, just at the horizontal axis.

Then g i is the position of the top of the plot of the i'th variable.

And the top of the plot of the i'th variable g i is just equal to g0

plus that variable and all the variables below it in the stack.

And we can start to do some analysis.

And because of this analysis, there is an alternative layout called ThemeRiver.

And ThemeRiver basically centered the vertical plot,

the stack of variables along the horizontal axis.

Set it set g0 to be one-half of the total height of the stack of variables.

And by doing that, it basically said that the way this thing is varying at the top

will be a mirror image of the way this thing is varying at the bottom.

And you get more of a, the appearance of a river that data is kind of streaming by

and evolving and it reduced the amount of shearing that was happening,

but it didn't eliminate it completely.

You can see for example, this region right here, is creating a large

shift in the data around it that you'd like to be able to minimize.

So just by centering the vertical stack of data around the horizontal axis

minimizes the heighth of the chart around

that horizontal axis and it also minimizes the slope at the top and the bottom.

There's a Streamgraph layout that does an even better job of this and

it does this just by changing where the position g0 is.

Streamgraph sets g0 equal to the result of this formula.

So you just evaluate this formula based on your data values f i,

and we won't go through the derivation of this formula.

But by just changing where the base of this stacked bar chart or

stacked line chart occurs and then stacking based to that new baseline,

we get an even smoother appearance and it makes it even easier for

us to make comparisons as we move horizontally across this chart,

to determine relative changes to each of these dependent variables.

It minimizes the deviation and the wiggle, the deviation being how far

a variable's plot moves from its previous position on the horizontal axis,

and the wiggle is minimizing the slope, basically the sheer effect

that you get from the wiggle.

You can also improve the appearance by changing the order in which you

add variables.

So, if variables are zero until a certain position on the horizontal axis,

and then they change from zero to some value,

you can change where they appear in the stack of variables.

And so in this case, they're just stacked in some fixed ordering,

some arbitrary ordering.

And you can see kind of a stream in the coloring.

Coloring each variable differently helps perceive that stream.

In this case they're colored based on when the variable

starts to take on a value other than zero.

But you can also add new variables

when the variables take on a value other than zero.

You can always add them to the outside of the graph.

And what that does, is it takes variables that start out

at a certain point on the horizontal axis, as soon as they take on the nonzero value,

you add them but you add them to the outside, so

their initial surge in value doesn't disturb the other variables.

It's happening on the outside of the graph.

And then as it's waning, other variables

are taking non-zero values and being added to the outside of the graph.

And so you get this nice, flowing appearance where you can see the relative

change in values of these variables as you move right on the horizontal axis,

even though we're looking at 10 or 20 variables in a given stack, vertically.

So we learned how to make a stream graph, and

how something as simple as stacking a bar chart can become

a point of further investigation in order to how to display it effectively.

We also learned that, for example,

a pie chart can become misleading especially when shown in three dimensions.

[MUSIC]