Here's a summary list of the responses to my PCA query
listed below.
It appears that whereas the statments are not incorrect, there is
some debate as to how accurate they may be.
The main point is that there is no absolute rule in discarding principal
components which explain small proportions variance.
Using the 1/n *100% criterion may therefore be justifiable in
some circumstances.
Many thanks again to everyone that responded.
Eric Grist
-------------------------------------------------------------------
The original query was:-
Suppose there are n prinicipal components associated with an n-variable
multivariate data set ( ie where n>1). It's been
quoted that:-
Statement 1
If any principle component explains greater than 1/n *100% of the variance
then that principal component by itself accounts for a "significant amount
of the variance" in the data.
Statement 2
The "justification" is that if the data consists entirely of white noise
then each principal component would be expected to explain only 1/n * 100%
of the variance. If more than that is explained by any principal component
it therefore suggests that that prinicipal component is "explaining some
structure" in the data.
Clearly this is related to the problem of deciding what criterion
to use in discarding "insignificant" principal components.
-------------------------------------------------------------------------
(1)
Deciding how many principal components are 'significant'
is usually based on ad hoc rules, or on tests which are
often founded on dubious assumptions. There is a very
large literature out there. Starting points are the books
by J E Jackson (1991) A user's guide to principal components,
Wiley, or by me (1986) Principal component analysis, Springer.
However, there is still a lot of stuff appearing, which I hope
to summarise when I write the second edition.
In the meantime, your statements are OK, provided that you
treat them as approximations to reality and don't rely on
them alone to determine the number of useful components.
Ian Jolliffe
--------------------------
(2)
These statements are NOT accurate. If the data consisted entirely of white
noise, then the eigenvalues of the covariance matrix will not all be equal
to one, even approximately. (The eigenvalues give the proportion of variance
accounted for by each corresponding eigenvector).
Allan
----------------------------
(3)
The statements are not accurate. The noise referred to also causes the
ESTIMATED principal components not to be equal to one another even when the
UNDERLYING PCs are. (The `any' PC is always interpreted as the largest of
the PCs, or of the remaining PCs).
Further:
Consider a very large set of 4-variate data. The smallest eigenvalue, equal
to, say 0.15,is, depending on the context, both statistically significant
and substantively important.
`Significance' requires a definition specific to the context and should not
rely on the nebulous convention that might have been adopted as a default in
a very different setting.
Nick Longford
DMU, Leicester
-------------------------------
(4)
Eric,
This is not quite what you asked but is related: components explaining
less than 1/n*100% of the variance can often contain useful information
- I have found this in a number of biological datasets.
Steve Langton
Steve Langton
MAFF Central Science Lab
Sand Hutton
______________________________________________________
Get Your Private, Free Email at http://www.hotmail.com
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|