JISCMail - ALLSTAT Archives

Email discussion lists for the UK Education and Research communities

Subscriber's Corner

Email Lists

ALLSTAT Archives

allstat@JISCMAIL.AC.UK

View:

Message:

[

First

Last

]

By Topic:

[

First

Last

]

By Author:

[

First

Last

]

Font:

Proportional Font

		LISTSERV Archives
		ALLSTAT Home
		ALLSTAT March 2015

Options

Subscribe or Unsubscribe

Get Password

Subject:

Factor Analysis/PCA : form of data : Replies

From:

Kim Pearce <[log in to unmask]>

Reply-To:

Kim Pearce <[log in to unmask]>

Date:

Fri, 27 Mar 2015 10:09:25 +0000

Content-Type:

text/plain

Parts/Attachments:

text/plain (1 lines)

Dear all,



Many thanks to the following for replying to my Factor Analysis question posted on 24th March (below):



John Sorkin

Abhaya Indrayan 

Michael Greenacre

Fionn Murtagh 

Francesca Greselin 

Leacky Kamau

Peter Das

Christian Hennig

Isaac Dialsingh



I attach their replies below (in no particular order)  for those who are interested.



Best Regards,

Kim





Dr Kim Pearce PhD, CStat

Senior Statistician

Haematological Sciences

Institute of Cellular Medicine

William Leech Building

Medical School

Newcastle University

Framlington Place

Newcastle upon Tyne

NE2 4HH



Tel: (0044) (0)191 208 8142











----- Original message -----

From: Kim Pearce <[log in to unmask]>

To: 

Subject: Factor Analysis/PCA : form of data

Date: Tue, 24 Mar 2015 15:34:38 +0000



Hi everyone,



A quick question....



In a hypothetical data set, say we have N patients and p 

biomarkers.  The biomarkers each take a continuous form of 

measurement.  There is also another variable, say, "symptom score" 

which is recorded for each patient with values ranging from 0 to 50.



I want to generate of 'map' of these patients where patients 

clustering together have similar characteristics.  I intend to use 

either factor analysis or PCA.  Now I had originally thought of 

doing this analysis using the p biomarkers, generating my plot and 

superimposing the 'symptom score' on each point on the plot to see 

if clusters of individuals also have similar symptom 

scores....however I then started wondering if I could actually do a 

factor analysis/PCA using the p biomarkers *and* symptom score 

together in an analysis.



Any views are much appreciated.



Many thanks in advance.



Kind regards,

Kim.



-----Original Message-----



Hi Kim,



Following on from John Sorkin's suggestion, you may want to look 

into methods for matrix reordering, seriation, and visualisation; 

e.g.



Liiv, I. (2010). Seriation and matrix reordering methods: An 

historical overview. Statistical Analysis and Data Mining, 3, 70–91.

Liiv, I., Opik, R., Ubi, J., & Stasko, J. (2012). Visual matrix 

explorer for collaborative seriation. Wiley Interdisciplinary 

Reviews: Computational Statistics, 4, 85-97.

Wu, H. M., Tien, Y. J. & Chen, C. H. (2010). GAP: A graphical 

environment for matrix visualization and cluster analysis. 

Computational Statistics and Data Analysis, 54, 767-778. [See also, 

"GAP: Generalized Association Plots", 

http://gap.stat.sinica.edu.tw/Software/GAP/ ]



If you are interested in examining the association between the 

biomarker variables and the symptom score variable, then you may 

also want to take a look at methods for visualising correlation 

matrices; e.g. http://weitaiyun.blogspot.com/2009/03/visulization-

of-correlation-matrix.html . (With the usual caveats about 

correlation coefficients.)



Kind regards,

------------------------------------------------------------

Dear Kim,

I am interested in working with you on this kind of problems. With some colleagues, we have developed clustering techniques which perform clusterwise regression. Hence, say, the symptom score could be explained in a different way in the group of  healthy people and in the group of non-healthy people. Please look at the attached papers. 



In the first one we introduce robust regression with local models for groups of patients having similar features. Hence it could be very effective in discriminating the two groups of people and in providing a way to interpret how the symptom score is related to the p biomarkers. 



In the second one we propose a model for estimating latent factors in groups. This means that correlations between the p biomarkers could be explained by some few underlying factors, which are the cause, or the explanation of the observed biomarkers. 



Hence both of your aims are covered. Both are robustly estimated models, hence they are able to deal with departures from the underlying normality assumptions, which in many cases are too strong for real data. 

If you are interested, we could have a chat on these advanced methods.



Cheers,



P.S. The heatmap()function  display the results of a hierarchical clustering by permuting the rows and the columns of a matrix to place similar values near each other according to the clustering. Hierarchical clustering could be very effective in some case, but anyway it depends on some distance you choose. Our methodologies are instead based on providing a model fit for the data, in such a way that you can make previsions, as well as  discover and interpret correlations between the involved measured variables. The advantage of having a model, if this is a good model for the data at hand, is that you are able to make interpretations.



-----------------------------------------------------------------



Kim:



Please see if you would like to consider cluster analysis in place of factor analysis for numerical clusterization, and then plot. 



-----------------------------------------------------------------



You could do a PCA where the first dimension is constrained to be the symptom score.  This concentrates that one variable on a single axis so it's not mixed up with the others on different principal axes.  I did something similar in a different context, you can see the write-up in my book Biplots in Practice at :



  http://multivariatestatistics.org/biplots.html



go to chapter 12.

Only problem is I don't think I've posted the R code for that chapter yet, I wanted to completely redesign and simplify the presentation of the code and haven't had time!

Good luck,

-------------------------------------------------------------------



I bet a PCA for which you 'Partial out' the symptom scores is possible (atleast in SAS) depending on the number of unique scores that you have



Kind regards,



--------------------------------------------------------------------



Kim

A heatmap may do exactly what you are looking for, at least as far as giving a visual representation of the clustering. I suggest you look at the R routine heatmap.



---------------------------------------------------------------------

A few remarks.



The difference between factor analysis and principal components analysis

is often unclear. I see factor analysis is a relic from an age without

computer power. You postulated one or a few factors and saw each variable

as a linear combination of these factors, plus an own variance.

Statistical programs sometimes quietly do PCA when you ask for factor

analysis.



If you map the patients, in a sense you use PCA to reduce the number of

variables, which is a good thing. Following your original idea, why not

first do a regression of the symptom score on the first few principal

components?

It would help you interpret what happens when you add the symptom score in

the PCA kettle to the biomarkers.



Cheers,

--------------------------------------------------------------------



Dear Kim,



there is no purely statistical reason why one or the other is preferable. 

It depends on what you want really. A map based on biomarkers only shows 

you how the patients are mapped in terms of biomarkers alone, a map that 

combines biomarkers and symptom score will show you the "landscape" of 

patients based on combined information. You may have use for one as well 

as the other.



To me personally mapping the patients based on biomarkers alone seems 

intuitively more appealing for a number of reasons:

a) If you want to relate the symtom score to what you get from the 

biomarkers as a whole, you may want to treat the biomarkers and the 

symptom score in a different way in this analysis, reflecting their

different role in interpretation of the results.

b) Chances are that the measurement scales of the biomarkers are 

compatible whereas the symptom score is very different. This can be 

"repaired" to some extent by standardisation/running the PCA on 

correlations, but still this is a somewhat artificial device. If the 

measurement of all the biomarkers are indeed comparable and more variance 

means, in some sense, that the corresponding biomarker is more important, 

you may actually *not* want to standardise them.

c) If you have p biomarkers and one symptom score, given that you 

standardise all variables, the biomarkers as a whole have p times more 

influence on the resulting PCA than the symptom score. Particularly if p 

is large, the symptom score may not change much. Although you may want a 

map that takes into account both symptom score and biomarkers, you may not 

be happy with the fact that the symptoms score, which is a very special 

variable in such a collection, has a weight of just 1/(p+1). (This can be 

amended by reweighting the variables, but this would require a bold 

subjective decision.)



Still, despite all these arguments, what should decide the issue is still 

what interpretation you intend the resulting map/clustering to have, which 

is a biological issue rather than a statistical one.



Best wishes,

-----------------------------------------------------------------------------------

Kim,

I am not sure if this makes sense. But perform a PCA on the p biomarkers. Then choose an apprriate number of principal components and regress the 'selected components' on symptom score.



--------------------------------------------------------------------------------------



_________________________________________________________

You may leave the list at any time by sending the command

SIGNOFF allstat

to [log in to unmask], leaving the subject line blank.

_________________________________________________________





You may leave the list at any time by sending the command



SIGNOFF allstat



to [log in to unmask], leaving the subject line blank.

Top of Message | Previous Page | Permalink

JiscMail Tools

Files Area | help

RSS Feeds and Sharing

Search Archives

Advanced Options

Archives

April 2024
March 2024
February 2024
January 2024
December 2023
November 2023
October 2023
September 2023
August 2023
July 2023
June 2023
May 2023
April 2023
March 2023
February 2023
January 2023
December 2022
November 2022
October 2022
September 2022
August 2022
July 2022
June 2022
May 2022
April 2022
March 2022
February 2022
January 2022
December 2021
November 2021
October 2021
September 2021
August 2021
July 2021
June 2021
May 2021
April 2021
March 2021
February 2021
January 2021
December 2020
November 2020
October 2020
September 2020
August 2020
July 2020
June 2020
May 2020
April 2020
March 2020
February 2020
January 2020
December 2019
November 2019
October 2019
September 2019
August 2019
July 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
September 2017
August 2017
July 2017
June 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
2006
2005
2004
2003
2002
2001
2000
1999
1998

JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk