We can then create an elbow plot by plotting the r squared values for
each of the k equals one to nine cluster solutions.
To help us determine how many clusters to retain and interpret.
To do this though, we first have to extract the r squared value from
the output For each of the 1 to 9 cluster solutions.
And merge them together using the following code,
data clus1 tells sass to create a dataset called clus1.
Set cluststat1 tells Sass to use the cluster analysis
to a statistics dataset for K=1 to create this dataset.
We are then going to create a variable called nclust.
Which will be the variable that identifies the value of K for the r square.
So we will set nclust=1.
Then we select r-square statistics, which is quoted RSQ and the underscore type,
underscore variable using quotes because it is a string variable.
Finally, we'll keep the nclust variable and the variable label over_all.
Which is the variable that contains the actual r-square value.
Then we do the same for K equals two through nine.
We'll then create one dataset called clusrsquare,that
contains the rsquare values for all nine cluster solutions.
By adding together these nine rsquare data.
Data clusrsquare is the name of our new data set.
Now, we type set and list the nine data sets that we want to add together.
Then we type run to run the code.
Now we can plot the elbow curve using the clusrsquare data set,
with the gplot procedure.
The first line of code provides some display parameters for the plot.
Color equals blue tells Sass to plot the r-square in blue.
And I-N-T-E-R-P-O-L equals join,
tells Sass i to connect each of the plotted r-square values with a line.
Then we type proc gplot and the name of the data set,
which is clusrsquare followed by a semicolon.
In the next line of code, we type the plot command to plot the name of the variable
that has the R2 values over_all on the Y axis.
And the variable that has the number of clusters nclust
on the x axis followed by semi colon.
Then we type run to generate the plot.
So, what this plot shows is the increase in the proportion advantage in
the clustering variables, explained by each of the cluster solutions.
We start with the K equals 1R squared which is zero
because there's no clustering yet.
Then we can see that the two cluster solution accounts for
about 20% of the variance.
The R-square value increases as more clusters are specified.
What we're looking for here is a bend in the elbow that
kind of shows where the R-square value might be leveling off.
Sets the adding more cluster doesn't increase the r-square is much.
We can see how subjective this is though.
There appears to be a couple of bends in the line at 2 clusters,
4 clusters, and again at 8 clusters.
To help us figure out which solutions is best,
we should further examine the results for the 2, 4, and 8 cluster solutions.
To see whether the clusters overlap.
Or the patterns of means on the clustering variables are unique and meaningful.
And whether there are significant differences between the clusters
on our external validation variable, GPA.