0:16

Also, the approach of using Solver for

the cluster analysis is only practical for datasets that are relatively small.

So we just want to show that it is possible to use Excel to approach

cluster analysis from the point of view of an optimization problem.

To make it more interesting we're going to show how to use Excel for

cluster analysis using an example.

As we mention before, companies are often interested in segmenting their customers

to better target a specific product offerings to meet specific customer needs.

Let’s assume that an online retailer has collected data

on three contents of her customers.

This data is in the Excel file Market Segmentation- Excel.

You might want to pause the video to locate the file and

open it so you can follow along or

just continue to watch the video and play with the model later, it's up to you.

Let us start in the original Data tab to examine what data was collected for

each of the three candy customers.

You can see that for each customer, there are two demographic variables,

income level and number of dependence as well as two buying behavior variables,

number of purchases made last year, and the average value of each purchase.

Let's assume that the online retailers wants to know

if the data supports the hypothesis that there are three segments in her market.

So we're going to use closer analysis to help her answer this question.

The first step in our analysis is to normalize the data.

Recall that normalization is a transformation of each variable

in the dataset.

In our table, the variables are in the columns label Income,

Average Purchase, Last year purchases and the Number of Dependents.

The normalization consists of subtracting the mean and

dividing by the standard deviation.

Click on the normalized Data tab to see the normalized values.

The normalized values have a mean of 0 and a standard deviation of 1.

Now, click on the Clusters tab to look at the model.

The model is based on the idea of choosing three customers to represent the centroids

of the three clusters.

Cells H5, H6, and H7 have the three customers that haven't chosen a centroids.

The arbitrary solution that is shown on the spreadsheet consists of customers 10,

20, and 30 as the centroids of the three clusters.

The center values in the table correspond to the normalized values for

the selected customer.

For example, if we change 10 to 15, we see that the central values change.

In this other table, the model uses Euclidean

distances to assign the customers to the closest cluster.

For example, we see that the first six customers are assigned to cluster 2.

Customer 7 is assigned to cluster 3 and customer 10 is assigned to cluster 1.

The model also calculates the total squared distance,

which is the sum of all the distances in column E.

Now we're ready to optimize, which is nothing more than searching for

the set of three customers that are the best centroids.

Clearly, the best set is the one that minimizes the total sum of a squares.

We don't want to do this by manually changing the values because believe it or

not, there are more than 4.4 million ways of choosing three customers

from a set of 300.

So we're going to use an Excel tool called Solver.

To access this tool, we got to the Analyze Group in the Data tab.

If Solver is not there, you need to first load it.

If you Google adding solver to Excel, you will find instructions on how

to add Solver to the different versions of Excel for Windows and for the Mac.

You can pause this video and come back here once you have loaded Solver.

Okay, if your Solver is ready to go, then click on it and

you will see that the model is all ready there.

This is a simple model for the set of objective,

SJ9 that is the objective function is the total sum of squares.

We want to minimize this value so Min is chosen.

The changing cells are the values that the solver can manipulate to minimize your

objective function.

In this case, there are only three cells, H5 to H7, which correspond

to the three customers that are going to be selected as centroids.

And then, there are three constraints related to the changing cells.

The values in the changing cells must be integer.

They must be greater than or equal to 1, and they must be less than or

equal to 300.

For reasons that go beyond the scope of this course for

this particular problem, we must choose the Evolutionary Solver.

You can now click on the Solve button.

The solution process starts, and

search progress is shown on the status bar at the bottom of the spreadsheet.

The solution time depends on the computer speed.

So if it is taking too long,

you can always press this K key to stop the search.

The Evolutionary Solver is a so called Metaheuristic.

What this means is that it cannot guarantee that when it stops,

it has found the very best solution.

Also called the optimal solution.

When the Solver stops, it simply reports the best solution that it could find.

It is states that the Solver can not improve the current solution.

We're going to press the Escape key and assert the solver solution.

Since we may end up with different solutions,

we're going to change the solution that we have to one that I found, and

that I would like to discuss with you.

The solution has customer 103, as the first Centroid,

customer 112 as the second, and customer 170 as the third.

Let's enter those values in the corresponding cells and

examine the normalized centroid.

6:04

The normalized values are very helpful in interpreting each

cluster because they are centered at 0.

So negative values are below average, and positive values are above average.

Let's interpret the first cluster.

The one that has customer 103 as the centroid.

The normalized values for this group are all negative.

The group is about one standard deviation below the average in all attributes.

Perhaps these are students given that their income is below average, they don't

buy very frequently, and they don't spend a lot of money when they do buy.

Also they don't seem to have dependents.

Moving to the second cluster,

represented by customer 112 we see that all the normalized values except for

dependent at at least one standard deviation above the average.

This group could be one of middle aged professionals with above average income

and average number of dependents.

They make frequent online purchases and

they spend above average in the items they buy.

Finally, the third cluster represented by customer 170 has an average income and

also makes purchases with a value that is close to the average.

However, the number of dependents that the members of this group

has is almost one standard deviation above average.

This group could very well represent families.

Therefore, a reasonable conclusion in this analysis

is that the data seems to indicate that the market for this online retailer

consists of three segments the students, families, and professionals.

We have shown how cluster analysis can be approached as an optimization problem.

We have mold the problem in a way that the search for the best set of clusters is

equivalent to searching for the best set of customers to represent each cluster.

In our model a customer that represent a cluster

becomes the centroid of the cluster.

To illustrate the process we use a market segmentation example.

This is the same example that we will use in the next video to illustrate how to do

cluster analysis with Excel minor.