Dear all,
Many thanks for your emails re. PCA. I am attaching the replies I
received (below).
On another (different) point I'd like to ask for some more advice:
In a data set, I have variables (ordinal and binary) where a number of
values within each variable are 'not applicable'. Could you advise me
on which methods I could use to of deal with such data?
Is it OK to simply code the 'not applicables' as a separate category
within these variables and recode the levels as dummy variables? For
example:
1) say if we had a binary variable with levels 'yes','no','not
applicable', could we recode this as:
Yes 0 1
No 1 0
N/A 0 0
2) Say if we had an ordinal variable with levels 0,1,2 and 'not
apllicable' could we recode as:
(0) 1 0 0 0
(1) 0 1 0 0
(2) 0 0 1 0
(N/A) 0 0 0 0
Alternatively, if it made sense, for a particular variable, could the
'not applicable' be coded as 'no' or 'none'?
E.g. for the question 'how much did you spend on holiday', a person who
did not go on holiday would enter 'not applicable' - in other words
he/she spent 'nothing' - so could we code it as:
(Nothing/0) 0
(minor) 1
(moderate) 2
(major) 3
Many thanks for your views on this, literature seems to ignore this
subject so any advice is appreciated
Kim
***********************************************
***********************************************
Original PCA message:
Dear all
I would like to ask 2 questions regarding PCA.
1) I have n objects with p variables recorded for each of the n
objects. How small can p be in order to get an adequate PCA?
2) Say in a PCA we retain k PCs. Each PC is interpreted in terms of the
variables with relatively large coefficients. Does it always have to be
the case that all p variables have large coefficients somewhere within
the retained k PCs or is this not necessary? For example, say if we
retained 3 PCs:
PC1 PC2 PC3
X1 -0.81 -0.16 0.0009
X2 -0.03 0.66 0.009
X3 -0.58 0.0008 0.02
X4 0.001 0.74 0.0009
X5 0.058 0.0008 0.75
X6 0.02 0.008 0.66
Here, all 6 variables have large coefficienst somewhere within the 3
retained PCs. Does this always have to be the case?
Many thanks,
Kim
********************
REPLIES:
Dear Kim,
1) The PCA for two variables is equivalent to a 2 variable correlation,
if the correlation between those two variables is high it may be
indicative of a relationship. I would say that the PCAs adequacy isn't
decided by the number of variables but by the data collected and the
hypotheses you are working with.
2) The PCs are all orthogonal - ie the variation in the observed data
that each principal component accounts for is unique to that component.
The PCs are ordered, PC1 accounting for the highest amount of variation
in your data. The eigen values of the correlation matrix can be used to
calculate the proportion of variation that each PC accounts for. Most
statistical packages should output these values as well. You will often
find that the first few PCs will account for 80-95% of the variation in
your data and it follows that the remaining components may not be
usefully interpreted even if they have large coefficients.
Hope this makes sense and help you.
*****************
No it does not. If a variable has poor communality with the other
variables, it may not have a large coefficient except on one of the
lower facotrs (with eogenvalues less than one. However, since PCA does
account for all the variability in the data set, if you keep all the
components, however, each variable must have at least one high
coefficient on a factor or moderate coefficients on several factors.
*********************
On 22-Nov-04 K F Pearce wrote:
> I would like to ask 2 questions regarding PCA.
>
> 1) I have n objects with p variables recorded for each of the n
> objects. How small can p be in order to get an adequate PCA?
Well, it doesn't make much sense for p=1, but doing PCA with p=2 could
make perfect sense!
> 2) Say in a PCA we retain k PCs. Each PC is interpreted in terms of
> the variables with relatively large coefficients.
> Does it always have to be the case that all p variables have large
> coefficients somewhere within the retained k PCs or is
> this not necessary? For example, say if we retained 3 PCs:
> PC1 PC2 PC3
> X1 -0.81 -0.16 0.0009
> X2 -0.03 0.66 0.009
> X3 -0.58 0.0008 0.02
> X4 0.001 0.74 0.0009
> X5 0.058 0.0008 0.75
> X6 0.02 0.008 0.66
>
> Here, all 6 variables have large coefficienst somewhere within the 3
> retained PCs. Does this always have to be the case?
No. There's no reason to expect this. The coefficients (with standard
PCA output) will be such that their squares sum to 1; you can check this
in your example above.
So it's possible for each coefficient in a PC to be equal to 1/sqrt(p);
this would correspond to the mean (of the p variables in each case) to
being one of the PCs.
At least one of the coefficients must be at least equal to 1/sqrt(p),
but there's no necessity for them to be different (however, if one is
larger than this, then the maximum possible size amongst the remainder
is correspondingly less). Taking an example with p=4 (for simplicity)
you could have
PC1: 1/2 1/2 1/2 1/2
PC2: 1/2 1/2 -1/2 -1/2
PC3: 1/2 -1/2 1/2 -1/2
PC4: 1/2 -1/2 -1/2 1/2
All sorts of patterns are possible. A large coefficient simply indicates
that a variable is a main contributor to the PC.
In your example, large coefficients occur in the pattern:
PC1: C . C . . .
PC2: . C . C . .
PC3: . . . . C C
(where "C" means large coefficient, "." small) showing that
PC1 is mainly the sum of X1 and X3,
PC2 is mainly the sum of X2 and X4
PC3 is mainly the sum of X5 and X6
whereas in the preceding 4-variable example all variables contribute
equally to each PC but in different patterns of
reinforcement/cancellation (+/-).
It's more important to look at the proportions of the total variance
which are accounted for by the different PCs.
Hoping this helps,
**************************
> -----Original Message-----
> From: A UK-based worldwide e-mail broadcast system mailing list
> [mailto:[log in to unmask]]On Behalf Of K F Pearce
> Sent: 22 November 2004 15:28
> To: [log in to unmask]
> Subject: PCA
>
>
> Dear all
>
> I would like to ask 2 questions regarding PCA.
>
> 1) I have n objects with p variables recorded for each of the n
> objects. How small can p be in order to get an adequate PCA?
I don't know what "adequate" means to you. p = 2 could make perfect
sense in some problems. On other hand in many problems PCA with two
variables could be a pure waste of time.
> 2) Say in a PCA we retain k PCs. Each PC is interpreted in terms of
> the variables with relatively large coefficients. Does it always have
> to be the case that all p variables have large coefficients somewhere
> within
> the retained k PCs or is this not necessary? For example, say if we
> retained 3 PCs:
> PC1 PC2 PC3
> X1 -0.81 -0.16 0.0009
> X2 -0.03 0.66 0.009
> X3 -0.58 0.0008 0.02
> X4 0.001 0.74 0.0009
> X5 0.058 0.0008 0.75
> X6 0.02 0.008 0.66
>
> Here, all 6 variables have large coefficienst somewhere within the 3
> retained PCs. Does this always have to be the case?
There is an art and a science to this: in many problems I would
"retain", meaning take seriously, a PC with a large eigenvalue and (as
far as possible) an interpretation that makes sense scientifically.
But there are situations in which all of the PCs are important (as a
restructuring of the information in the data).
John Gower in one of his papers argued, as I recall, that PCs with near
zero eigenvalues could be interesting because they show characteristics
of the data that are nearly invariant. In practice convincing examples
seem few.
To put it another way, PCA is a technique that can be applied in lots of
different styles to different problems. I doubt that there are single
simple answers to either of your questions.
************************
Dear Kim,
1. Usually, a rule of thumb is that you have 10 x the number of subjects
(n) in relation to the number of variables (p). p has no minimum, but of
course if you are interested in clustering variables in a smaller number
of principal components (factors) you need more variables than factors.
2. The interpretation of the factor pattern(pattern of loadings) is
arbitrary. That is why this matrix is rotated. There are different
rotation methods, like varimax, oblimin etc. Each has its own advantages
and disadvatages. I suggest that you conslut a textbook on factor
analysis.
Best wishes,
*************************
Hi,
Don't, off the top of my head, know the answer to your first question,
but for your second, the answer is "no". Generate some artificial data
consisting of six-dimensional vectors whose first two components are
independent and normally distributed with mean zero and standard
deviation 1. Choose the remaining four components independently, each
normally distributed with mean zero and standard deviation 0.2.
If you do PCA on such data you will find that two of the singular values
are much larger than the other four and that the corresponding vectors
come close to spanning the same subspace as is spanned by (1, 0, 0, 0,
0, 0) and (0, 1, 0, 0, 0, 0).
Hope that helps,
*****************************************
Kim,
1. You would rarely perform PCA with less than 4 variables because you
can visualise such data readily using conventional plots.
2. No, not all variables need to have a high loading on any of the
retained PCs. A variable which carries little or independent information
may behave like this.
Best regards,
********************************************
|