JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for ALLSTAT Archives


ALLSTAT Archives

ALLSTAT Archives


allstat@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Monospaced Font

LISTSERV Archives

LISTSERV Archives

ALLSTAT Home

ALLSTAT Home

ALLSTAT  2004

ALLSTAT 2004

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

PCA summary & 'not applicables' coding query

From:

K F Pearce <[log in to unmask]>

Reply-To:

K F Pearce <[log in to unmask]>

Date:

Wed, 24 Nov 2004 09:46:08 -0000

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (273 lines)

Dear all,

Many thanks for your emails re. PCA. I am attaching the replies I
received (below).

On another (different) point I'd like to ask for some more advice:

In a data set, I have variables (ordinal and binary) where a number of
values within each variable are 'not applicable'. Could you advise me
on which methods I could use to of deal with such data?

Is it OK to simply code the 'not applicables' as a separate category
within these variables and recode the levels as dummy variables? For
example:

1) say if we had a binary variable with levels 'yes','no','not
applicable', could we recode this as:

Yes 0 1
No 1 0
N/A 0 0

2) Say if we had an ordinal variable with levels 0,1,2 and 'not
apllicable' could we recode as:

(0) 1 0 0 0
(1) 0 1 0 0
(2) 0 0 1 0
(N/A) 0 0 0 0

Alternatively, if it made sense, for a particular variable, could the
'not applicable' be coded as 'no' or 'none'?

E.g. for the question 'how much did you spend on holiday', a person who
did not go on holiday would enter 'not applicable' - in other words
he/she spent 'nothing' - so could we code it as:

(Nothing/0) 0
(minor) 1
(moderate) 2
(major) 3

Many thanks for your views on this, literature seems to ignore this
subject so any advice is appreciated

Kim
***********************************************
***********************************************
Original PCA message:
Dear all

I would like to ask 2 questions regarding PCA.

1) I have n objects with p variables recorded for each of the n
objects. How small can p be in order to get an adequate PCA?

2) Say in a PCA we retain k PCs. Each PC is interpreted in terms of the
variables with relatively large coefficients. Does it always have to be
the case that all p variables have large coefficients somewhere within
the retained k PCs or is this not necessary? For example, say if we
retained 3 PCs:
       PC1 PC2 PC3
X1 -0.81 -0.16 0.0009
X2 -0.03 0.66 0.009
X3 -0.58 0.0008 0.02
X4 0.001 0.74 0.0009
X5 0.058 0.0008 0.75
X6 0.02 0.008 0.66

Here, all 6 variables have large coefficienst somewhere within the 3
retained PCs. Does this always have to be the case?

Many thanks,
Kim

********************

REPLIES:

Dear Kim,

1) The PCA for two variables is equivalent to a 2 variable correlation,
if the correlation between those two variables is high it may be
indicative of a relationship. I would say that the PCAs adequacy isn't
decided by the number of variables but by the data collected and the
hypotheses you are working with.

2) The PCs are all orthogonal - ie the variation in the observed data
that each principal component accounts for is unique to that component.
The PCs are ordered, PC1 accounting for the highest amount of variation
in your data. The eigen values of the correlation matrix can be used to
calculate the proportion of variation that each PC accounts for. Most
statistical packages should output these values as well. You will often
find that the first few PCs will account for 80-95% of the variation in
your data and it follows that the remaining components may not be
usefully interpreted even if they have large coefficients.

Hope this makes sense and help you.

*****************
No it does not. If a variable has poor communality with the other
variables, it may not have a large coefficient except on one of the
lower facotrs (with eogenvalues less than one. However, since PCA does
account for all the variability in the data set, if you keep all the
components, however, each variable must have at least one high
coefficient on a factor or moderate coefficients on several factors.
*********************
On 22-Nov-04 K F Pearce wrote:
> I would like to ask 2 questions regarding PCA.
>
> 1) I have n objects with p variables recorded for each of the n
> objects. How small can p be in order to get an adequate PCA?

Well, it doesn't make much sense for p=1, but doing PCA with p=2 could
make perfect sense!

> 2) Say in a PCA we retain k PCs. Each PC is interpreted in terms of
> the variables with relatively large coefficients.
> Does it always have to be the case that all p variables have large
> coefficients somewhere within the retained k PCs or is
> this not necessary? For example, say if we retained 3 PCs:
> PC1 PC2 PC3
> X1 -0.81 -0.16 0.0009
> X2 -0.03 0.66 0.009
> X3 -0.58 0.0008 0.02
> X4 0.001 0.74 0.0009
> X5 0.058 0.0008 0.75
> X6 0.02 0.008 0.66
>
> Here, all 6 variables have large coefficienst somewhere within the 3
> retained PCs. Does this always have to be the case?

No. There's no reason to expect this. The coefficients (with standard
PCA output) will be such that their squares sum to 1; you can check this
in your example above.

So it's possible for each coefficient in a PC to be equal to 1/sqrt(p);
this would correspond to the mean (of the p variables in each case) to
being one of the PCs.

At least one of the coefficients must be at least equal to 1/sqrt(p),
but there's no necessity for them to be different (however, if one is
larger than this, then the maximum possible size amongst the remainder
is correspondingly less). Taking an example with p=4 (for simplicity)
you could have

  PC1: 1/2 1/2 1/2 1/2
  PC2: 1/2 1/2 -1/2 -1/2
  PC3: 1/2 -1/2 1/2 -1/2
  PC4: 1/2 -1/2 -1/2 1/2

All sorts of patterns are possible. A large coefficient simply indicates
that a variable is a main contributor to the PC.
In your example, large coefficients occur in the pattern:

  PC1: C . C . . .
  PC2: . C . C . .
  PC3: . . . . C C

(where "C" means large coefficient, "." small) showing that
  PC1 is mainly the sum of X1 and X3,
  PC2 is mainly the sum of X2 and X4
  PC3 is mainly the sum of X5 and X6

whereas in the preceding 4-variable example all variables contribute
equally to each PC but in different patterns of
reinforcement/cancellation (+/-).

It's more important to look at the proportions of the total variance
which are accounted for by the different PCs.

Hoping this helps,

**************************
> -----Original Message-----
> From: A UK-based worldwide e-mail broadcast system mailing list
> [mailto:[log in to unmask]]On Behalf Of K F Pearce
> Sent: 22 November 2004 15:28
> To: [log in to unmask]
> Subject: PCA
>
>
> Dear all
>
> I would like to ask 2 questions regarding PCA.
>
> 1) I have n objects with p variables recorded for each of the n
> objects. How small can p be in order to get an adequate PCA?

I don't know what "adequate" means to you. p = 2 could make perfect
sense in some problems. On other hand in many problems PCA with two
variables could be a pure waste of time.

> 2) Say in a PCA we retain k PCs. Each PC is interpreted in terms of
> the variables with relatively large coefficients. Does it always have

> to be the case that all p variables have large coefficients somewhere
> within
> the retained k PCs or is this not necessary? For example, say if we
> retained 3 PCs:
> PC1 PC2 PC3
> X1 -0.81 -0.16 0.0009
> X2 -0.03 0.66 0.009
> X3 -0.58 0.0008 0.02
> X4 0.001 0.74 0.0009
> X5 0.058 0.0008 0.75
> X6 0.02 0.008 0.66
>
> Here, all 6 variables have large coefficienst somewhere within the 3
> retained PCs. Does this always have to be the case?

There is an art and a science to this: in many problems I would
"retain", meaning take seriously, a PC with a large eigenvalue and (as
far as possible) an interpretation that makes sense scientifically.

But there are situations in which all of the PCs are important (as a
restructuring of the information in the data).

John Gower in one of his papers argued, as I recall, that PCs with near
zero eigenvalues could be interesting because they show characteristics
of the data that are nearly invariant. In practice convincing examples
seem few.

To put it another way, PCA is a technique that can be applied in lots of
different styles to different problems. I doubt that there are single
simple answers to either of your questions.
************************

Dear Kim,

1. Usually, a rule of thumb is that you have 10 x the number of subjects
(n) in relation to the number of variables (p). p has no minimum, but of
course if you are interested in clustering variables in a smaller number
of principal components (factors) you need more variables than factors.

2. The interpretation of the factor pattern(pattern of loadings) is
arbitrary. That is why this matrix is rotated. There are different
rotation methods, like varimax, oblimin etc. Each has its own advantages
and disadvatages. I suggest that you conslut a textbook on factor
analysis.

Best wishes,

*************************

Hi,

Don't, off the top of my head, know the answer to your first question,
but for your second, the answer is "no". Generate some artificial data
consisting of six-dimensional vectors whose first two components are
independent and normally distributed with mean zero and standard
deviation 1. Choose the remaining four components independently, each
normally distributed with mean zero and standard deviation 0.2.

If you do PCA on such data you will find that two of the singular values
are much larger than the other four and that the corresponding vectors
come close to spanning the same subspace as is spanned by (1, 0, 0, 0,
0, 0) and (0, 1, 0, 0, 0, 0).

Hope that helps,
*****************************************
Kim,

1. You would rarely perform PCA with less than 4 variables because you
can visualise such data readily using conventional plots.
2. No, not all variables need to have a high loading on any of the
retained PCs. A variable which carries little or independent information
may behave like this.

Best regards,
********************************************
  

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

May 2024
April 2024
March 2024
February 2024
January 2024
December 2023
November 2023
October 2023
September 2023
August 2023
July 2023
June 2023
May 2023
April 2023
March 2023
February 2023
January 2023
December 2022
November 2022
October 2022
September 2022
August 2022
July 2022
June 2022
May 2022
April 2022
March 2022
February 2022
January 2022
December 2021
November 2021
October 2021
September 2021
August 2021
July 2021
June 2021
May 2021
April 2021
March 2021
February 2021
January 2021
December 2020
November 2020
October 2020
September 2020
August 2020
July 2020
June 2020
May 2020
April 2020
March 2020
February 2020
January 2020
December 2019
November 2019
October 2019
September 2019
August 2019
July 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
September 2017
August 2017
July 2017
June 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
2006
2005
2004
2003
2002
2001
2000
1999
1998


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager