这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，深入学习模式发现的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

Loading...

来自 University of Illinois at Urbana-Champaign 的课程

数据可视化

594 个评分

这这一课程中，我们将学习数据挖掘的基本概念及其基础的方法和应用，然后深入到数据挖掘的子领域——模式发现中，深入学习模式发现的概念、方法，及应用。我们也将介绍基于模式进行分类的方法以及一些模式发现有趣的应用。这一课程将给你提供学习技能和实践的机会，将可扩展的模式发现方法应用在在大体量交易数据上，讨论模式评估指标，以及学习用于挖掘各类不同的模式、序列模式，以及子图模式的方法。

从本节课中

Week 4: The Visualization Dashboard

In this week's module, you will start to put together everything you've learned by designing your own visualization system for large datasets and dashboards. You'll create and interpret the visualization you created from your data set, and you'll also apply techniques from user-interface design to create an effective visualization system.

- John C. HartProfessor of Computer Science

Department of Computer Science

So, now that we understand

information visualization and information visualization systems,

we can start to use them to visualize data held in databases,

and we can use them in concert with data mining.

So, often, when we visualize data,

the data is going to come from a database and it'll

be quite useful and effective if during

the interactive visualization session when we're trying to investigate

data for our own purposes and gain insight into the data.

If we can connect the tools for visualization into queries to that database,

so that it's easier to dig deeper into the data directly.

So, there are databases that support this.

Often, modern databases support OLAP processing,

Online Analytical Processing which basically allows

them to be accessed over the web using standard protocols.

Conceptually, a good mental model for the data is this data cube metaphor.

If you have for example, a sales database that

represents the amount of sales on a given date,

of a given product at a given location,

then you can think of the keys to that dataset, the date,

the product and location has dimensions of for example a cube here.

So, each row of this data corresponds to a cell in this dataset.

So if the date is the horizontal axis,

the product is the vertical axis,

and the location is the depth axis in this cube,

then you've got three coordinates and that gives you a particular cell in

this cube and that cell will represent the amount

of sales at that date of that product at that location.

The challenge is going to be taking this dataset and visualizing it.

I'm using an example of a cube with

three dimensions and we used to perceiving three dimensions,

but we perceive three dimensions by projecting it

into two dimensions and looking at that.

In real-world applications you're going to have many more keys

and this data cube is actually a data hypercube and you'll have

many more than just three dimensions of data and

the challenge is going to be taking this high dimensional data space,

this data hypercube and then finding the appropriate two dimensions in

which to investigate and then you can always add

more dimensions as glyphs or other visualization tools.

So, the way we're going to reduce

the dimensions of the dataset is going to be largely with

data aggregation and there's a large number of

database visualization tools based on data aggregation.

The common ones like Tableau or SAP's

Lumina or even the pivot tables that are available in Microsoft Excel

or are based on data aggregation as a way of simplifying a dataset and

reducing the dimensions of variation by providing summaries of the data.

So, for example, here's some data we're plotting some quantitative value

vertically as we change some other quantitative value horizontally, you got that?

An independent variable here and a dependent variable here.

This is a two-dimensional plot,

we could always project this onto a one-dimensional axis,

and we can also summarize this data.

These would be values,

these would be measures,

changing across one axis into

a single value by representing it by the mean value or the maximum value,

the minimum value or the sum of these values.

The mean is just going to be the sum divided by all the values of the dimension.

The dimension of the dimension.

So, we have various operators,

some mean, median, minimum, maximum,

there's other statistical operators like standard deviation or variance or

other characteristics you can use that will

simplify data and basically remove one dimension.

In this case, removing the change in value over

one dimension to a single value over no dimension.

We can also convert quantitative continuous data

into quantitative-discrete data or ordinal or nominal data,

and the count function does that.

For example, if we have these blue red and green data points,

there's three categories here,

they're being plotted over two-dimensional area that's

going to represent continuous quantitative data horizontally and vertically.

We can then remove those axes and just have a single axis representing the count,

the number of blue items,

the number of red items and the number of green items.

That reduces the dimensionality of this dataset from varying over two dimensions with

a third category into just a category over essentially one dimension.

Finally, there's binning.

Binning is a discretization method that converts a quantitative continuous data

into quantitative discrete data or ordinal or nominal data depending on how you use it.

For example, we may have our continuous function,

these are values that are varying over a single dimension.

Instead of having these values represented continuously over a continuous dimension,

we can discretize this dimension.

Then we have our quantitative variable that's being sampled

on four regions and represented by in this case,

the mean value over those four bin areas.

So, we're doing a projection operator in

this case over four areas instead of over the single area.

But again, that does reduce the dimensionality of the dataset.

Finally, we can use this to create a histogram.

A histogram is binning, in this case,

in the vertical direction instead of the horizontal direction.

So, instead of having bins over the dimension,

you are having bins over the value that you're plotting over that dimension.

In this case, we've separated the values into an A,

B, C, D and E range.

Then we're adding up all the values in the A range,

all the values in the B range,

all the values in the C range,

and all the values in the D range and just plotting those sums.

There's no values in the E range so,

there would be zero for the E range.

So, this histogram, when you choose these buckets,

the size of these bins,

gives you other characteristics of the data that can

reduce the dimensionality of the data or at least reduce

a continuous variable into discrete categories of a continuous variable.

So, our data cubes can use these aggregation operations to simplify

a dataset and that provides a useful tool for investigating the dataset.

So for example, we can take our data cube in this case,

it has location horizontally,

products vertically and time in depth.

If we for example,

if we want to look at all of our sales of tea, coffee,

espresso or other products over time at different times,

but we don't care about the location,

then we can project this data cube

into basically this square region, this two-dimensional region.

So this is a two-dimensional projection that's summing up all of

the sales at any given time of a given product regardless of location.

So it's summing them up over all the locations and

that projection gives us something we can then visualize.

We can further summarize that projecting this two-dimensional data cube into

a one-dimensional data cube by summing in this case, over product.

We don't care about the differences of the products.

If we don't want to differentiate the product,

we can aggregate the product axis sum over that,

and now we have a one-dimensional basically,

just a list of numbers that tells us the amount of sales for the first quarter,

second quarter, third quarter or fourth quarter.

If we don't care about the time,

then we can just look at the total sales over a given amount of time,

over a given set of products,

over a given set of locations,

add all those up and that gives us a zero-dimensional data cube here.

So, here's an example using Tableau to show how

aggregations are used for data cube operations.

Here we have some dimensions of our data.

The data is collected over ten years.

So we have separate data by year and over a variety of different countries.

Then we've got the,

what I'm showing here is our standard method for plotting population logarithmically,

horizontally, over life expectancy vertically.

You'll see that we're averaging the data and we have a single data point.

We basically have a zero-dimensional data,

it's all being projected to a single point

because we're averaging over all these dimensions.

Specifically at the time dimension over all the years that we

have data and over all the countries that we have data.

If we want to disaggregate this data,

then we can do that by basically dragging into this marks area.

For example, country.

Now, we spread out the data,

each one of these data points represents

a different country that these averages are just averages over

the year that the data's collected on and add over the country and the region.

We can further disaggregate over the year and we get

this plot which shows you each country that's how it's changing over the year.

If we want to see correspondences between countries,

then I can drag the country into color so that each country gets its own color.

So you can see how the colors,

you can use the color to follow the country over the years it's been disaggregated.

We can start adding more and more dimensions from our data cube into this visualization.

So, when you're looking at an individual cell of these different forms of data cubes,

you can work from a less detailed view to a more detailed view.

A single cube when it's averaged,

could represent a range of products,

a range of locations,

a range of times or it could represent

the values in a specific location at a specific time.

In its disaggregated form it's representing an actual data point.

In its aggregated form,

it's representing an aggregate of the values over these ranges.

So, this could be the total sales of all products,

over all markets, over all time.

Then you can start to drill down to these details by basically focusing for example,

on a particular instance of time,

or on a particular product,

or on a particular type.

Each time you do that,

you're replacing an aggregated dimension where you have

a single value representing a range of one of

these dimensions with a specific product or a specific data point along that dimension,

a specific coordinate along that dimension.

So by doing that,

this data cube approach allows you while you're performing

your visualization to be able to drill down in the details

and tend to back out into more summary views.