Hi everyone,
Thank you to those who responded to my question! It was a great help.
The summary of answeres is attached next.
Regina
Original Question:
Should I remove highly-correlated variables when I perform clustering
analysis? Does it make a difference if I do not? Is there a rule of thumb on
how high the correlation should be in order to remove the variable?
Quotes from the actual answers:
1) From a statistical view my guess is that you need to transform to
uncorrelated variables ( pca perhaps) or to use a Mahalanobis distance
measure. The possibility of a dimension reduction from the pca segment
is a possible attraction.
2) No,you should not remove highly correlated variables - the correlation
might be due to the occurrence of clusters! See recent article on spurious
correlations (in JRSS (A),, if I remember correctly) If you cannot find the
article I can look back through my journals.
3) Having recently done a load of clustering where some of the variables
were highly correlated, we made the decision to use Principal Components
instead of the raw variables (used enough PCs to explain approximately 90%
of the data). This meant that we were not throwing variables out, but were
still able to have more confidence that one/two highly correlated variables
were not over-influencing the clusters.
4) A great deal depends on the substantive nature of your data and the
reason you are clustering. Usually the variables are considered fairly
independent. In many contexts, it is common to use some form of factor
scores. I know of no rule of thumb for what is "highly correlated".
Clustering has a great deal of art to it. You would want to try several
approaches to see if the results are very different. You might also use
something like discriminant function analysis, remembering that tests as
such lose most meaning when you use the same variables as those in the
clustering. Many of the parts of the listing from a package like SPSS can
help give you insight into the different results.
5) We do a fair amount of clustering and we almost always have correlated
data. We generally use principle components to calculate factors, come to
some sort of decision on how many factors we are going to use, but I would
tend to use more than less, I will come to the reason why later. Then we
would run a non-hierachial k-means cluster analysis on the factor scores
(because we generally have large sample size). Obviously you need to use
factors that make sense to the client or whoever, and I would use relatively
more than less because it has been argued that the true discriminators of
your population are more likely to lie in the lower factors than the first
few that come out. Using K-means u need to check for robustness, a quick and
easy, though not necessarily entirely accurate way of doing this is to
rearrange your data and then re-run the cluster analysis, if you get the
same solution they are relatively robust.
6) Cluster analysis doesn't have prior conditions as linear models so the
variables don't have to be normal distributed or independent and can be
correlated. So it doesn't matter if your variables are correlated or not.
But it will affect the kind of distance that you will use in a posterior
step. If you are using an Euclidean distance it will assume that the
variable values are uncorrelated with the another. As in most applications
this assumption will not be justified an alternative is to use Mahalanobis
distance. Multivariate data analysis from Hair et al. and Cluster analysis
from Everitt give examples about this.
|