0:00

In this video, we discuss data reduction and

unsupervised learning, which are two essential concepts in cluster analysis.

A dataset is essentially a table where the variables, which are also called

features or attributes, are in the columns and the observations are in the rows.

This means that all the data values are in the body of the table.

The process of reducing the number of variables is known, as dimensionality

reduction, while grouping observations is a form of data reduction.

Isolating the key variables in a dataset is important in order to build robust,

predictive models.

It turns out that often,

there is some degree of redundancy among the variables in a dataset.

And this is why it is possible to reduce the number of dimensions without losing

critical information.

Redundancy occurs when different attributes respond in similar ways to

some common underlying factor.

For instance, let's assume the human resource department of a company

creates an instrument to measure job satisfaction.

Employees are asked to rate seven statements using a scale from

one to seven, where one means that they strongly disagree with the statement and

seven means that they strongly agree.

Let's also assume that the statements in this survey are.

My supervisor treats me with consideration.

My supervisor consults me concerning important decisions that affect my work.

My supervisor gives me recognition when I do a good a job.

My supervisor gives me the support I need to do my job well.

My pay is fair.

My pay is appropriate, given the amount of responsibility that comes with my job.

My pay is comparable to the pay earned by other employees

whose job are similar to mine.

Let's suppose that the HR department wants to use the responses as seven separate

variables to predict intention to quit.

The problem with conducting this study the way it is set up is

a redundancy in the predictive variables.

The seven items in the questionnaire are not really measuring seven different

constructs.

More likely, items one to four are measuring a single construct that could

be reasonably be labeled satisfaction with supervision.

While items five to seven are measuring a different construct that could be

label satisfaction with pay.

2:27

These constructs could be identified with a technique called

principal component analysis or PCA for short.

This technique creates new variables as linear

combinations of the original variables.

These new variables are called principal components.

In our job satisfaction example,

a principal component analysis would identify two components.

PCA would transform the original seven values into two scores,

one for each component.

We don't show how these scores were calculated.

But for example, in this case, the employee with ID 102274

seems to be more satisfied with supervision than with pay.

Cluster analysis, on the other hand, is a data reduction technique

in the sense that it can take a large number of observations and

reduce them into a small number of identifiable groups.

Each of these groups can be interpreted more easily and

is represented by a centroid.

The scatter plot shows four clusters for the scores in the job satisfaction survey.

The stars represent the centroid of each cluster and

can be used to characterize all the observations in the group.

For instance, the gray cluster consists of employees with low job satisfaction and

is represented by average scores close to two.

Cluster analysis can achieve very significant data reductions

by transforming thousands or

even hundreds of thousands of observations into interpretable groups.

4:30

The critical feature of this historical data is

that classification of the observations is known and

it is used to learn how to classify future observations.

Because this piece of information is available,

the process is known as supervised learning.

For instance, this table shows ten answers to the job satisfaction survey and

it also indicates whether or not the employee quit.

The two employees that quit had low ratings for the salary questions five,

six and seven and some mixed ratings for the supervisor questions one through four.

A prediction model built on this data will fall in the category of

supervised learning, because the outcome that

the model is trying to predict is known in historical data.

In unsupervised learning, the observations in the historical data are not labeled.

That is, we don't know if an observation belongs to one group or another.

This means that we don't know how many different groups there are in

a population from which the dataset originated.

Discovering the number of groups is therefore,

one of the main outcomes of the analysis.

For example, in a previous video,

we described how the market intelligence firm, Information Resources Incorporated,

conducted a cluster analysis of survey data to establish that the market of

natural and organic products consisted of seven distinct segments,

a number that was not known prior to the completion of the analysis.

Cluster analysis can also be applied to historical data that is labeled

with the purpose of finding new labels.

For example, in one study,

cluster analysis was used to categorize mutual funds based on

their financial characteristics instead of their investment objectives.

The historical data for

the study consisted of 904 different funds that fund managers had

classified into seven categories according to the investment objectives.

That is the fund managers assigned a label to each fund and

decided there were seven possible labels.

However, a cluster analysis on financial variables related to the funds

concluded that there were only three distinct fund categories.

The reduction in the number of categories has significant benefits to

investors seeking to diversify their portfolios.

The study determined that the consolidated categories

were more informative about performance and

risk than the original seven categories created by the fund managers.

In terms of data to use, the analyst initially considered 28 financial

variables that were related to risk and return.

However, after applying principal component analysis, they found that 16 out

of the 28 variables were able to explain 98% of the variation in the dataset.

Therefore, they only use 16 variables per cluster which as we all ready mentioned,

resulted in three fund categories.

This example shows that dimensionality reduction and

data reduction compliment each other.

As a matter fact, it is a common practice to apply dimensionality

reduction techniques such as PCA before clustering.