Dear all,
The general consensus in reply to my email (below) was that, since my
variables are measured in the same units a covariance-based PCA would be
preferable. However, it was suggested that both covariance and correlation
based PCA could be tried and the results compared. I have copied all
replies to my email below for your perusal.
Many thanks for your replies,
Kim.
ORGINAL EMAIL:
Hello all
I am going to do a principal components analysis. I know that we can
standardise the variables (correlation-based PCA) to make them 'equally
important'.
I have read that
1)if we omitted standardisation, a variable which varied lot would tend to
dominate the principal components, and
2) standardisation to make variables equally important is suggested when the
variables are measured in different units.
My question is: if my variables are all measured in the same units (say
each variable contained scores for n people and each score could take one of
5 values 1 to 5), would it still be OK to do a correlation based PCA (i.e.
using covariance matrix of standardised variables) or would covariance based
PCA (I.e. using covariance matrix of unstandardised variables) be more
appropriate?
Many thanks, in advance for your help,
All the Best,
Kim.
REPLIES:
Ahh, that's always a good one for debate over morning coffee... I
don't think there is a clear-cut, always-appropriate answer - in the
social sciences you often get data on '1 to 5 scales' (sometimes with
labels, sometimes not); the question is, in the minds of the people
answering, is a one-point difference on one scale the same as a
one-point difference on another scale? If the questions (and answer
options) are phrased the same way, quite possibly, and an unstandardised
analysis looks appropriate. If the questions are about quite different
things, or the answering options are quite different (Say a question
about how important something is, compared to how much they would
like/dislike something), the respondents may be using your five
categories quite differently, and it might be better to standardise.
Often what I tend to do is to try both, and hope that the same pattern
emerges regardless
Hope this helps
*****************
Hello,
You can choose any one at this situation.
But if it is possible for you could you please
calculate both of them and give to us general result.
Thank you.
*****************
Kim,
If your variables are in different units then it is strongly recommended
that you standardise otherwise the largest ones will tend to dominate which
is not normally what you want. This is equivalent to finding the eigenvalues
of the correlation matrix as opposed to the covariance matrix.
In your case I would not standardise as all your variables are in the same
units. The danger of standardising in this case is that the importance of
variables which show relatively little variation can be inflated. Having
said that, there is of course nothing to stop you trying both options.
Best regards,
**************
Hi Kim
My opinion is that it is better to do it on unstandardised data in this case
(i.e. covariance option, not correlation), so that the natural variance in
the variables counts directly in the PCA. If a variable doesn't vary much
on the 1-5 scale, then it won't influence the PCA so much, whereas if you
standardise it it will count just as much as the others.
******************
If the variables are in the same units,
covariance-based PCA is perfectly sensible,
and it may have the consequence you describe.
I would have thought the decision on whether
that was a feature or a nuisance should be driven
by the scientific or practical problem. If you
want to _use_ the first PC (e.g.) it's important
to be clear which you prefer. If you want
merely some insight into the structure of variation
the conclusions may be very similar.
The only clearcut rule here would seem to be
the negative one that covariances based on
different units don't make sense.
Please summarise to the list as requested.
********************
Hi Kim,
I would suggest that that you try all three methods and decide from the
results. Standardisation is not an exact science, as yet, and it is
usually the results that determine the type of standardisation to use
and no the other way round. However, from you e-mail it would appear
that standardisation by subtraction of the mean and then dividing by the
standard deviation may be the most appropriate method.
*******************
Kim,
Extract fr4om my lecture notes:
So far we have been talking of computing principal components from the
variance-covariance matrix $\BSigma$. It is, however, quite common
to find the principal components from the correlation matrix $\matr P$,
which effectively amounts to normalizing all the variates to have unit
variance before finding principal components. Choosing to do this
involves a definite but nevertheless arbitrary decision to make the
variables `equally important'. Note that the diagonal elements of a
correlation matrix are all unity, so that
\[ \text{tr}(\matr P)=p\textit{\ and hence\ }\sum_{i=1}^p\lambda_i=p. \]
The inventors of Genstat say that, ``Some people prefer to use
correlations most of the time \dots Our own preference is to use
correlations only when there is a very good reason for doing so, rather
than the reverse.'' (Digby \textit{et al.}\ (1992, Section 5.4). On the
other hand, Everitt and Dunn (1991, Section 4.2) say that, ``Although
the derivation of principal components given above has been in terms of
eigenvalues and eigenvalues of the covariance matrix $\matr S$, it is
much more usual in practice to derive them form the corresponding
quantities of the correlation matrix $\matr R$. The reasons are not
difficult to appreciate if we imagine a set of multivariate data where
the variables $x_1$, $x_2$, \dots, $x_p$ are of course completely
different types, for example lengths, temperatures, blood pressures,
anxiety ratings, etc. In such a case the structure of the principal
components derived from the covariance matrix will depend upon the
essentially arbitrary choice of units of measurement; additionally if
there are large differences between the variances of $x_1$, $x_2$,
\dots, $x_p$ those variables whose variances are largest will tend to
dominate the first few principal components. [An example illustrating
this is given in Jolliffe (1986, Section 2.3).] Extracting the
principal components as the eigenvalues of $\matr R$ which is equivalent
to calculating the principal components from the original variables
after each has been standardized to have unit variance overcomes this
difficulty. It is important to realize, however, that the eigenvalues
and eigenvectors of $\matr R$ will generally not be the same as those of
$\matr S$; indeed, there is rarely any simple correspondence between the
two and choosing to analyse $\matr R$ rather than $\matr S$ involves a
definite but possibly arbitrary decision to make the variables `equally
important' ''.
Krzanowski (1988, Section 2.2) says, ``As a general guideline, however,
it would seem sensible to standardize first whenever the measured
variables show marked differences in variances, or wherever they concern
very different measured entities or units. Unstandardized data are
preferable whenever the measured variables are comparable both with
respect to units and variances, or when a simple and straightforward
data plot is the sole purpose of the analysis.''
Seber (1984, Section 5.2) says, ``Some practitioners maintain that
$\matr R$ should always be used instead of $\BSigma$ or $\matr S$,
as $\matr R$ does not depend on the scales used for the original
variables. This approach would seem reasonable in psychological and
educational studies where the scales may be arbitrary and the data
little more than ranks. However, the distribution theory associated
with $\matr R$ is much more complex and there are some problems in
interpreting the coefficients given by the eigenvectors \dots About all
that can be said is that similar variables should be measured in the
same units where possible.''
Gnanadesikan (1977, page 12 (quoted with approval by Seber,
\textit{ibid.}) says that there, ``does not seem to be any
\textit{general} elementary rationale to motivate the choice of scaling
of the variables as a preliminary to principal components analysis on
the resulting covariance matrix.''
A useful discussion can also be found in Jolliffe (1986, Section 2.3).
Regards,
*************
Kim,
As you say, if the variables are measured on different units of measurement,
then that constitutes a definite requirement to work with the standardised
variables (which give the correlation matrix, rather than the covariance
matrix). (By the way, PCA using the correlation matrix is often called
factor analysis).
However, if the variables are all based on a Likert scale for different
questions in the same questionnaire, it seems quite reasonable to me to
let those that contribute more of the variance to the total score to be
more influential in determining the factors. This line of reasoning would,
of course, suggest PCA on the covariance matrix. The problem is that, by
convention in this type of area, people tend to use factor analysis, i.e.
to work with the correlation matrix.
I would tend to use PCA and argue in its favour in the paper that you
submit.
****************
|