Dear Allstat,
I was recently applying cluster analysis on seven clustering variables. My question to Allstat is how to choose the correct number of cluster to represent the data. I initially used Wards method. I plotted wards coefficient for the last 5 cluster solutions. The last 5 had % increase of 5,5,5,10,27. Thus this would indicate a 3 cluster solution. So, I went with a 3 cluster solution. These clusters were distinct, that is ANOVA and tukey showed that the means of the seven clustering variables were significantly different on all 3 clusters. Also, I used as set of external variables to examine the clusters. I used a Chi-squared test. 10 out of the 15 variables showed differences. So, I left it at that. Six months later I came back to this, I want to validate my solution, that is use a different method to choose the number of clusters. I used silhouette statistic index, Davies-Bouldin index, and the Dunn index, all of these indexes showed that the 2 cluster solution was the solution to choose. Again, the clusters were distinct. However, with the 4 cluster solution there were similarities between the means of the clusters. Also, approximately 11 of the 15 external variables were shown as different between the 2 cluster solutions.
My question to Allstat is: which solution is the correct one to go with 2 or 3 and for what reasons?
You may leave the list at any time by sending the command
SIGNOFF allstat
to [log in to unmask], leaving the subject line blank.
|