0:00
In the last video we show how the problem of dividing a set of observations
into clusters can be formulated as an optimization model.
Finding an optimum solution to the problem is complicated and
requires quite a bit of computational effort.
Think about this, in the example that we introduced in the previous video,
we had data on 300 customers of an online retailer.
The full optimization problem would amount to searching for
the best way of partitioning 300 objects into 3 none-empty subsets.
The number of choices is astronomical.
So, instead of solving that problem,
we simplify it by reducing the number of decisions to three.
In this reviews model we just needed to search for
three customers that were used as the centers of the clusters.
Even with this simplification the number of choices is over 4.4 million.
0:54
This is quite large for a problem with relatively few observations.
The bottom line is that the optimization approach has some severe limitations for
large datasets.
And this is why approximation methods such as hierarchical clustering and
K-means have become the most common procedures for cluster analysis.
XLMiner includes both of these methods.
If you are working toward the completion of this Introduction to
Business Analytics Specialization,
you have already used XLMiner in the second course and
you should be fairly familiar with either the Windows or the cloud version.
If for some reason you have not used XLMiner then you should first watch
the video Introduction to the Analytic Solver Platform.
In that video you will learn how to access the cloud version of the ASP,
that's the Analytic Solver Platform, or
to download the Excel version onto your computer.
So comeback here when you're ready to learn how to use XLMiner for clustering.
Let's start by opening the Excel file Market Segmentation- XLMiner.
The file contains the same data that we used in the previous video.
Recall that the data consists of demographic and purchasing metrics for
300 clients for an online retailer.
The retailer wants to use cluster analysis to identify three market segments.
Highlight the data table which is in the range from A3 to E303,
click on the XLMiner tab and go to the Data Analysis group of tools.
Click on Cluster and then choose K-Means Clustering.
The data range box should show the range A3 to E303.
The number of rows should be 300 and the number of columns should be 5.
Note that since we highlighted row three which contains the names of the variables,
the first row contains header box should be checked.
The variables in the input data box should list all the variables in our dataset.
Let's highlight all the variables except customer and
click on the arrow in the middle of the window.
This action transfers the variables to the selected variables box.
We now click on Next.
XLMiner offers a few choices to execute the K-mean algorithm.
First we can select whether or not to use normalized values.
We have already discussed the advantage of normalized values
in order to interpret results.
So we will check the Normalize Input data box.
3:28
Next we need to choose the number of clusters.
In this case, we want to find a solution with 3 clusters.
The number of iterations relates to
the number of times that the procedure is going to adjust this entrance.
We will use the default value of 10.
In the options area, XLMiner allows you to restart the K-Means procedure multiple
times, and we can also adjust the seed for the random number generator.
We will leave the default values,
that is we will use a Fixed start and a seed of 12345.
4:04
Click on Next and move to the Output Options.
We will leave both boxes checked and click on Finish.
XLMiner creates two worksheets, one named KMC_Output and
another one named KMC_Clusters.
Let us start by examining the output worksheet.
We scroll down to the cluster center section.
The original coordinates and the normalized coordinate tables,
show the centroids of the three clusters that the K-means procedure found.
To compare to what we found in the previous video
let's take a look at the normalized centroids.
We can see that the first cluster seems to be the one that we previously identified
as families, with normalized values around 0 except for the Number of Dependents.
The second cluster is the one that we identified as professionals.
The group members have fairly high income, they purchase frequently, and
they spend more money than the average.
The last cluster is the one that we labeled as students, because the group
is at least one standard should below the average in all attributes.
The results obtained by running the K-Means method on XLMiner match
what we generated by solving the optimization model using Solver.
The clusters are not exactly the same, but they lead to the same conclusions.
The KMC_Cluster worksheet contains the assignment of customers to clusters,
and the distances from customers to each cluster.
This information could be used to compare the membership of each cluster
between the solution that we found with the Solver in the previous video and
the solution that we just found with the XLMiner.
If we run the hierarchical clustering tool, we obtain similar results.
You might want to try this on your own.
However, I must warn you that the hierarchical clustering tool
in the XLMiner does not report the centroids of the clusters.
They had to be calculated from the cluster assignments.
So comparing the K-Means solutions with the hierarchical clustering solution
requires a little bit of extra work.
6:11
As you can see finding clusters using the XLMiner is straight forward.
However, before we go we need to address the question of how many clusters to
use in a cluster analysis.
There is basically no theory about how to find the right number of clusters.
In fact in some settings it might not be completely clear what the right number
of cluster means.
We know that the two streams are to either put every observation
in its own cluster or to put all observations in a single cluster.
The first option we have no predictive power,
because we will not have a cluster where to put a new observation.
The second option results in a trivial amount
that tells us nothing about new observations.
Most analysts would that agree with applying parsimony to cluster analysis.
Under this principle, we choose the smallest number of clusters
that generalize best to a new observation.
There are a few ways in which generalization could be measured.
For instance, we could measure how much the centroids will move if we
re-run the clustering method with the addition of new data.
We could also measure how much the cluster assignments
would change in the presence of new data.
Or, we could also check how much the total sum of squares would
change when assigning new data to the existing cluster.
In addition to these measures, we can also use a procedure where
we start with a small number of clusters and then we add one cluster at a time.
After each addition, we decide whether the new clusters have a more
meaningful interpretation than the clusters in the previous iteration.
Let's try this approach on our online retailer example.
At the moment we have a solution with three clusters.
We have interpreted three clusters and determined that there
are three identifiable markets, namely families, professionals and students.
If we use XLMiner to re-run K-means to find a solution with four clusters,
we find the normalized centroids shown in this table.
The first cluster is the one corresponding to families,
the second cluster corresponds to what we labeled as professionals.
So far nothing new has emerged compared to our solution with three clusters.
In fact, the centers for clusters 1 and 2 in this solution
are identical to the centers of cluster 1 and 2 in the solution with 3 clusters.
Now we examine clusters 3 and
4 and we observe that they are nearly identical to each other.
We compare the data summary tables for the solution with three clusters and
the solution with four clusters.
We can now clearly see what has happened.
Because we want a solution with four clusters,
we forced the third cluster in the solution with three clusters to be split.
The 100 observations in this cluster were split into 49 and
51 when we asked K-means to produce a solution with four cluster.
For this analysis it seems clear that the data
do not support the hypothesis of the existence of a fourth market.
The right number of clusters in this case seems to be three.
The goal of this module was to provide you with that good understanding of what
cluster analysis is, how this analysis is used in business and
how the analysis is performed.
You're now ready to apply this knowledge and the tools that we have discussed.
The assignment at the end of this module is a good as starting point to put cluster
analysis into action.