0:22
Let's start with hierarchical clustering.
This technique is conceptually simple.
The process begins with placing each object in its own cluster.
For example, suppose that we have five objects labeled A, B, C, D and E.
Then in the first step these objects are placed each in its own cluster.
In subsequent steps, the method calculates the distance between
each pair of clusters and then it merges the two closest to each other.
Let's suppose that objects A and
B are the objects that are closest to each other of all the possible pairs.
These objects are merged into a single cluster.
In the third step, the procedure merges the AB cluster with C.
And finally, the procedure stops when merging D and E.
The final result shows two clusters.
One with A, B, and C and the other one with D and E.
The procedure is simple.
But it needs to know how to measure distance from one cluster to another.
And it also needs to know when to stop merging clusters, otherwise,
it will always result in one cluster, with all the objects in it.
You might recall that, in the previous video,
we introduced five measures to calculate the distance between a pair of clusters.
These measures were single linkage, complete linkage,
average linkage, average group linkage, and Ward's method.
Any of these measures can be used in hierarchical clustering.
Software packages allow you to choose which measure to use.
In terms of when to stop, this is determined by the analyst.
Cluster analysis requires experimentation and
often there is no single one correct answer.
Experimenting with several distance measures and with different number of
clusters is a common practice in order to find meaningful groupings of data.
We will address some of these issues in the last video of this module.
Hierarchical clustering may be represented by a two-dimensional diagram
called dendogram.
This diagram has the form of an inverted tree.
The horizontal axis represents the observations and
the vertical axis represents distance.
The length of the branches represented by the vertical lines
indicate the distance between the two clusters that are merging.
The diagram is read from the bottom to the top
by sweeping a horizontal line across the entire tree.
If we start a distance of zero and move up,
the first merger that we find is the one of observations A and B.
At this point, we have a solution with six clusters.
One with A and B in it and the other five with one observation each.
That is C, D, E, F, and G.
The next merger of the AB cluster with observation
C occurs at the distance of 1.5.
Then observations D and E merged at distance of 2.
And observations F and G merged at a distance of 2.5.
At this level, the dendrogram shows a solution with three clusters.
One with A, B, and C, another one with D and E.
And a third one with F and G.
A two cluster solution is found is at the distance of three.
Since the dendogram contains all possible solutions,
it is only practical to examine it for datasets with relatively few observations.
Now let's talk about k-means.
The k-means method is very well known and widely used.
It is scalable, meaning that it can deal with very large datasets.
The overall idea is to assign objects to the nearest cluster.
Where distance is measured from the object to the centroid of the cluster.
Recall that the centroid is the cluster average.
The value of k indicates the number of clusters.
The process starts with the selection of k observations at the centroid
of the initial clusters.
This selection is somewhat arbitrary.
But the procedure works better if the initial centroids are as far apart
as possible.
Then, the rest of the observations are assigned to the closest centroid.
Once the assignment is completed, the cluster averages are recalculated.
These recalculation could change the position of the centroids and
cause a reassignment of the observations.
These two steps are repeated until the centroids do not change.
So let's take a look at a simple example.
We want to divide seven individuals into two groups based on their age and
annual income.
To initialize a procedure, we select David and
Clara as the initial members for the two clusters.
Sometimes these initial objects are referred to as seeds.
Since there is only one member in each cluster, the centroid of these initial
clusters are the values corresponding to the age and income of David and Clair.
Since income and age have values at very different scales,
we use the normalized values to calculate nucleon distances.
According to these calculations, we assign Ann and
George to David to form the red cluster.
The second cluster, colored in blue, has Clara, Bob, Erin and Frank.
The clusters centers are then recalculated and they move to a new position.
The new cluster positions cause Bob to be a little bit closer to the red centroid
than to the blue centroid, and therefore Bob is reassigned to the red cluster.
The centroids are recalculated once again and
this time no reassignments are possible.
The final clusters seem to be characterized by age,
with the red cluster consisting of all of the people under 40 and
the blue clusters with all the people over 40 years old.
The beauty of k-Means is that it is a very simple procedure.
The weakness is that the final clusters depend on the initial choices.
This is why cluster analysis software gives you the option of
running the procedure multiple times from seeds that are randomly chosen.
Process of creating clusters from a dataset can be viewed
as an optimization problem.
Both hierarchical clustering and K-means are procedures that find approximate
solutions to the problem maximizing the similarity of the objects in each cluster.
The maximization must take into consideration that there must be k
clusters and that all objects must belong to one cluster and only one cluster.
This is not easy optimization problem and
this is the reason business analytic softwares employ
approximation procedure such as hierarchical clustering and k-means.