Dear all,
Please find below some additional replies to my PCA queries which I sent
to the list recently. These have been kindly sent to me by Ian Joliffe.
-----Original Message-----
From: ian jolliffe [mailto:[log in to unmask]]
Sent: 10 February 2005 11:59
Subject: PCA queries
Kim
I recently came across a couple of your queries re. PCA on allstat.
I repeat them below.
------------------------------------------------------------------------
---------------
Dear all,
A quick query on PCA.
I know that we get good PCA results when the variables are highly
correlated...but can we enter variables which 'tell the same story'
e.g.
Variable 1
'Have you produced an academic paper in the last year?' (Yes/No),
Variable 2 'what was your highest contribution to any academic paper in
the last year?' (None/Minor/Moderate/Major)
Could these 2 variables both be entered into a PCA?
Many thanks,
Kim.--------------------------------------------------------------------
-----
I would never enter an ordinal variable asa single numerical variable,
e.g. 1, 2, 3, 4., unless I wasconfident that the 4 categories are
equally spaced, in which case it's not really an ordinal variable. If
order reallywas all you knew, then 1,2,10,13 or any other set of 4
orderednumbers is as plausible as 1,2,3,4, and depending on yourchoice
you get different results.I believe the only way to sensibly include
these variables in a PCA is to code Variable 2 as 3 dummy variables,
e.g.y1 = 1 for none, 0 otherwisey2 = 1 for minor, 0
otherwisey3 = 1 for moderate, 0 otherwise.y1 is then equivalent to your
variable 1 so you don't need to include variable 1
separately.-------------------------------------------------------------
--------------
Hi
all
A quick query about PCA.
If we enter variables A,B,C.....Z into a PCA... say for PC1,
coefficients of variables X,Y and Z had large absolute values of their
coefficients....and we find that PC1 accounts for 40% of the total
variation in the "original data". Am I correct in thinking that this
means PC1 accounts for 40% of the total variation in the full original
data set, A,B,C,.....,Z?
Is it also safe to say that, since X, Y and Z have large absolute values
for PC1 that *they* account for most of the total variation in the full
original data set?
Many thanks,
Kim
------------------------------------------------------------------------
-First
question: yes, where 'Total variation' = sum of variances of all p
variables A, B, C,.....Second question: no, but it also depends on
whether you're using a correlation or covariance matrix. If a
correlation matrix is used, all variables have unit variance and
proportion of variance accounted for 3 variables is 3/p, regardless of
what the PCs look like. For covariance matrices the variation in X,Y,Z
is var(X) + var(Y) + var(Z). For all three to appear with large
coefficients in the first PC I think they have to be highly correlated;
therefore the variance of thefirst PC is likely to be substantially
bigger than the sum of variancesbecause of the covariance terms.Ian
Jolliffe
|