0:08

So now that we understand information visualization and

information visualization systems, we can start to use them to visualize

data held in databases, and we can use them in concert with data-mining.

So often, when we visualize data, the data is going to come from a database,

and it'll be quite useful and effective if, during that

interactive visualization session, when we're trying to investigate data for

our own purposes and gain insight into the data, if we can connect the tools for

visualization into queries to that database, so

that it's easier to dig deeper into the data directly.

So there are databases that support this.

Often, modern devices support OLAP processing, online analytical processing,

which basically allows them to be accessed over the web using standard protocols.

And, conceptually, a good mental model for the data is this data cube metaphor.

And if you have, for example, a sales database that represents the amount of

sales on a given day of a given product at a given location,

then you can think of the keys to that data set, the date, the product, and

location as dimensions of, for example, a cube, here.

So each row of this data corresponds to cell in this data set, and so

if the date is the horizontal axis, the product is the vertical axis, and

the location is the depth axis in this cube, then you've got three coordinates,

and that gives you a particular cell in this cube, and that cell will represent

the amount of sales at that date, at that product, at that location.

The challenge is going to be taking this data set and visualizing it.

I'm using an example of a cube with three dimensions, and

used to perceiving three dimensions, but we perceive three dimensions by projecting

it into two dimensions and looking at that.

In real world applications, you're going to have many more keys, and

this data cube is actually a data hypercube, and

you'll have many more than just three dimensions of data.

And the challenge is going to be taking this high dimensional data space,

this data hypercube, and

then finding the appropriate two dimensions in which to investigate, and

then you can always add more dimensions as glyphs or other visualization tools.

2:45

So the way we're going to reduce the dimensions of the data set is going to be

largely with data aggregation, and there's a large number of

database visualization tools based on data aggregation.

The common ones like Tableau, or SAP's Lumina, or

ever the pivot tables that are available in Microsoft Excel,

are based on data aggregation as a way of simplifying a data set and

reducing the dimensions of variation by providing summaries of the data.

So, for example, here is some data.

We're plotting some quantitative value vertically as

we change some other quantitative value horizontally.

We've got an independent variable here and a dependent variable here.

This is a two-dimensional plot.

We could always project this onto a one-dimensional axis.

We can also summarize this data.

These would be values, these would be measures changing across one axis

into a single value by representing it by the mean value, or

the maximum value or the minimum value or the sum of these values.

The mean is just going to be the sum divided by all the values of

the dimension, the dimension of the dimension.

And so we have various operators, sum, mean, median, minimum, maximum.

There's other statistical operators, like standard deviation, or variants, or

other characteristics you can use, that will simplify data and

basically remove one dimension.

In this case, removing the change in value over one dimension

to a single value over no dimension.

We can also convert quantitative-continuous data into

quantitative-discrete data, or ordinal or nominal data.

And the count function does that.

For example, if we have these blue, red, and

green data points, there's three categories here.

They're being plotted over a two dimensional area that's going

to represent continuous quantitative data horizontally and vertically.

We can then remove those axes and just have a single axis representing the count,

the number of blue items, the number of red items, and the number of green items.

And that reduces the dimensionality of this data set,

from varying over two dimensions with a third category,

into just a category essentially over essentially one dimension.

And finally, there's binning, and binning is a discretization method that converts

quantitative continuous data into quantitative discrete data or

ordinal or nominal data, depending on how you use it.

For example, we may have our continuous function.

These are values that are varying over a single dimension.

6:01

And finally, we can use this to create a histogram.

And a histogram is binning, in this case,

in the vertical direction instead of the horizontal direction.

So instead of having bins over the dimension,

you're having bins over the value that you're plotting over that dimension.

In this case, we've separated the values into an A, B, C, D,

and E range and then we're adding up all the values in the A range,

all the values in the B range, all the values in the C range, and

all the values in the D range, and just plotting those sums.

And there's no values in the E range, so it would be zero for the E range.

And so this histogram, when you choose these buckets,

the size of these bins gives you other characteristics of the data that

can reduce the dimensionality of the data, or at least reduce a continuous

variable into discrete categories of a continuous variable.

7:00

And so our data cubes can use these aggregation

operations to simplify a data set, and

that provides a useful tool for investigating the data set.

So, for example, we can take our data cube.

In this case, it has location horizontally, products vertically,

and time in depth.

If we, for example, average out, if we want to look at all of our sales of tea,

coffee, espresso, other products over time at different times,

but we don't care about the location, then we can project this data cube into,

7:41

basically, this square region, this two dimensional region.

So this is a two dimensional projection that's summing up all of

the sales at any given time of a given product regardless of location,

so it's summing them up over all the locations.

And that projection gives us something we can then visualize.

We can further summarize that projecting this two-dimensional data cube into

a one-dimensional data cube by something, in this case, over product.

We don't care about the differences of the the products.

So if we don't want to differentiate the product, we can aggregate

the product axis, sum over that, and now we have a one dimensional,

basically just a list of numbers, that tells us the amount of sales for

the first quarter, second quarter, third quarter, or fourth quarter.

And if we don't care about the time,

then we can just look at the total sales over a given amount of time,

over a given set of products, over a given set of locations, add all of those up, and

that gives us a zero-dimensional data cube here.

So here's and example using Tableau to show how aggregations are used for

data cube operations.

Here we have some dimensions of our data.

The data is collected over ten years,

as we have separate data by year and over a variety of different countries.

9:17

And you'll see that we're averaging the data and we have a single data point.

We basically have zero-dimensional data.

It's all being projected to a single point, because we're averaging over all

these dimensions, specifically the time dimension over all the years that we

have data and over all the countries that we have data.

If we want to disaggregate this data,

then we can do that by basically dragging into this marks area.

For example, country.

And now we spread out the data.

Each one of these data points represents a different country, so

that these averages are just averages over the year that

the data's collected on, and not over the country and the region.

We can further segregate over the year, and we get this plot,

which shows you each country, how it's changing over the year.

If we want to see correspondences between countries,

then I can drag the country into color so

that each country gets its own color, so you can use the color to

follow the country over the years it's been disaggregated.

10:33

And we can start adding more and

more dimensions from our data cube into this visualization.

So when you're looking at an individual cell of these different forms

of data cubes, you can work from a less detailed view to a more detailed view.

A single cube, when it's averaged, could represent a range of products,

a range of locations, a range of times.

Or it could represent the value in a specific location in a specific time.

And this is in its disaggregated form, it's representing an actual data point.

In its aggregated form,

it's representing an aggregate of the values over these ranges.

So these could be the total sales of all products, over all markets, over all time.

11:24

And then you can start to drill down to these details by basically focusing,

for example, on a particular instance of time, or

on a particular product, or on a particular type.

And each time you do that, you're replacing an aggregated dimension

where you have a single value representing a range of one of these

dimensions with a specific product or specific data point