Hello all,
I am posting a summary of answers to my query about clustering binary
variables. I want to thank all who responded. Your help is greatly
appreciated!!
Regina
Original Email:
I would really appreciate recommendations with respect to tackling the
following problem: clustering binary variables. What is more appropriate:
factor analysis, principal component analysis,...
From:
Art
[log in to unmask]
Social Research Consultants
University Park, MD USA
Msg:
You might want to increase the reliability of constructs by creating scales.
e.g., by some form of factor analysis (PCA, PFA).
Since the input to either form of factor analysis, is a correlation matrix,
you could start by eliminating one of each pair that correlates over some
cutoff such as .8.
Be careful about the number of factors you extract.
You really shouldn't get substantively different results from the two
approaches. It still seems unusual to have so many dichotomous items without
some planned structure.
From:
Mike Procter
Msg:
Different strokes ...
There are a number of main decisions you have to make, and no one else can
make them for you. Here are some examples.
1. How do you define closeness in some sense between two variables? It
might be one of umpteen measures of association, or it might be some kind of
correlation. As an example of the kind of decision here, Pearson's
correlation can only reach unity if the marginals are identical, whereas
Yule's Q has a different property.
2. How do you define closeness between an existing cluster and a potential
new member (nearest neighbour, weighted mean square, etc.)?
3. What is your stopping rule (though some methods are non-agglomerative,
so they don't have one). Often the decisions will depend on substantive
(i.e. not mererly statistical) assumptions about the data.
Go and read a book, such as EVERITT, B.S. (1993) Cluster Analysis (3rd Ed).
London: Edward Arnold. You could even pick Brian's brains. However, in
case you should visit his website, don't make any firm decisions until you
see a RECENT mugshot.
Oh, the quick answer to your question is almost cerainly 'no'!
From:
Rafael Perera
Msg:
Adding a bit to what Mike Procter said. I think what you need to do is
some form of cluster analysis. As he mentions you need to define how you
are planning to measure distance between cases (particularly if you are
using hierarchical methods).
For the case of binary data two measures of distance come to mind: simple
matching and Jaccard's coefficient.
In simple matching you basically give the same weight to differences or
similarities (ie. it is the same if two cases have the attribute - both
are alive - or if both of them don't -both are dead. For Jaccard's
coefficient you are mainly interested in the cases "having" one particular
attribute but do not care if both "do not have it". So you count as equal
when they both have 1's -both smoke- but do not care if both have 0's -both
do not smoke. These two coefficients are commonly used by computer
scientists so a quick search in Google will surely direct you to definitions
and articles that use these measures.
Unfortunately I have just checked a commercial stats package (SPSS) and it
does not have these distances as an option. Nevertheless I am sure that
S-plus and R have at least the simple matching option.
From: Dave
Msg:
The quick answer is neither, principal component analysis and factor
analysis are primarily used with continuous data. Binary data can often be
summarised using a contingency table and correspondence analysis.
Alternatively you can use Multi-Dimensional Scaling (MDS). See
http://www.statsoft.com/textbook/stathome.html for more details.
From: Thierry Decae
Msg:
You can use normal clustering procedures such as fastclus in sas.
you first have to create dummy variables coded 1&0 for each level.
From: Adaikalavan Ramasamy
Msg:
I suspect you will get a lot of answers saying PCA because it is a proven
tehnique with well understood properties. 90-95 % of all statisticians
rubbish factor analysis because it makes an arbitary choice of rotation that
is not well explained. FA seems to prevail in the physcology community. A
google search will probably find you the pro and con of these two methods.
|