Print

Print


Dear all,

Many thanks to the following for replying to my Factor Analysis question posted on 24th March (below):

John Sorkin
Abhaya Indrayan 
Michael Greenacre
Fionn Murtagh 
Francesca Greselin 
Leacky Kamau
Peter Das
Christian Hennig
Isaac Dialsingh

I attach their replies below (in no particular order)  for those who are interested.

Best Regards,
Kim


Dr Kim Pearce PhD, CStat
Senior Statistician
Haematological Sciences
Institute of Cellular Medicine
William Leech Building
Medical School
Newcastle University
Framlington Place
Newcastle upon Tyne
NE2 4HH

Tel: (0044) (0)191 208 8142





----- Original message -----
From: Kim Pearce <[log in to unmask]>
To: 
Subject: Factor Analysis/PCA : form of data
Date: Tue, 24 Mar 2015 15:34:38 +0000

Hi everyone,

A quick question....

In a hypothetical data set, say we have N patients and p 
biomarkers.  The biomarkers each take a continuous form of 
measurement.  There is also another variable, say, "symptom score" 
which is recorded for each patient with values ranging from 0 to 50.

I want to generate of 'map' of these patients where patients 
clustering together have similar characteristics.  I intend to use 
either factor analysis or PCA.  Now I had originally thought of 
doing this analysis using the p biomarkers, generating my plot and 
superimposing the 'symptom score' on each point on the plot to see 
if clusters of individuals also have similar symptom 
scores....however I then started wondering if I could actually do a 
factor analysis/PCA using the p biomarkers *and* symptom score 
together in an analysis.

Any views are much appreciated.

Many thanks in advance.

Kind regards,
Kim.

-----Original Message-----

Hi Kim,

Following on from John Sorkin's suggestion, you may want to look 
into methods for matrix reordering, seriation, and visualisation; 
e.g.

Liiv, I. (2010). Seriation and matrix reordering methods: An 
historical overview. Statistical Analysis and Data Mining, 3, 70–91.
Liiv, I., Opik, R., Ubi, J., & Stasko, J. (2012). Visual matrix 
explorer for collaborative seriation. Wiley Interdisciplinary 
Reviews: Computational Statistics, 4, 85-97.
Wu, H. M., Tien, Y. J. & Chen, C. H. (2010). GAP: A graphical 
environment for matrix visualization and cluster analysis. 
Computational Statistics and Data Analysis, 54, 767-778. [See also, 
"GAP: Generalized Association Plots", 
http://gap.stat.sinica.edu.tw/Software/GAP/ ]

If you are interested in examining the association between the 
biomarker variables and the symptom score variable, then you may 
also want to take a look at methods for visualising correlation 
matrices; e.g. http://weitaiyun.blogspot.com/2009/03/visulization-
of-correlation-matrix.html . (With the usual caveats about 
correlation coefficients.)

Kind regards,
------------------------------------------------------------
Dear Kim,
I am interested in working with you on this kind of problems. With some colleagues, we have developed clustering techniques which perform clusterwise regression. Hence, say, the symptom score could be explained in a different way in the group of  healthy people and in the group of non-healthy people. Please look at the attached papers. 

In the first one we introduce robust regression with local models for groups of patients having similar features. Hence it could be very effective in discriminating the two groups of people and in providing a way to interpret how the symptom score is related to the p biomarkers. 

In the second one we propose a model for estimating latent factors in groups. This means that correlations between the p biomarkers could be explained by some few underlying factors, which are the cause, or the explanation of the observed biomarkers. 

Hence both of your aims are covered. Both are robustly estimated models, hence they are able to deal with departures from the underlying normality assumptions, which in many cases are too strong for real data. 
If you are interested, we could have a chat on these advanced methods.

Cheers,

P.S. The heatmap()function  display the results of a hierarchical clustering by permuting the rows and the columns of a matrix to place similar values near each other according to the clustering. Hierarchical clustering could be very effective in some case, but anyway it depends on some distance you choose. Our methodologies are instead based on providing a model fit for the data, in such a way that you can make previsions, as well as  discover and interpret correlations between the involved measured variables. The advantage of having a model, if this is a good model for the data at hand, is that you are able to make interpretations.

-----------------------------------------------------------------

Kim:

Please see if you would like to consider cluster analysis in place of factor analysis for numerical clusterization, and then plot. 

-----------------------------------------------------------------

You could do a PCA where the first dimension is constrained to be the symptom score.  This concentrates that one variable on a single axis so it's not mixed up with the others on different principal axes.  I did something similar in a different context, you can see the write-up in my book Biplots in Practice at :

  http://multivariatestatistics.org/biplots.html

go to chapter 12.
Only problem is I don't think I've posted the R code for that chapter yet, I wanted to completely redesign and simplify the presentation of the code and haven't had time!
Good luck,
-------------------------------------------------------------------

I bet a PCA for which you 'Partial out' the symptom scores is possible (atleast in SAS) depending on the number of unique scores that you have

Kind regards,

--------------------------------------------------------------------

Kim
A heatmap may do exactly what you are looking for, at least as far as giving a visual representation of the clustering. I suggest you look at the R routine heatmap.

---------------------------------------------------------------------
A few remarks.

The difference between factor analysis and principal components analysis
is often unclear. I see factor analysis is a relic from an age without
computer power. You postulated one or a few factors and saw each variable
as a linear combination of these factors, plus an own variance.
Statistical programs sometimes quietly do PCA when you ask for factor
analysis.

If you map the patients, in a sense you use PCA to reduce the number of
variables, which is a good thing. Following your original idea, why not
first do a regression of the symptom score on the first few principal
components?
It would help you interpret what happens when you add the symptom score in
the PCA kettle to the biomarkers.

Cheers,
--------------------------------------------------------------------

Dear Kim,

there is no purely statistical reason why one or the other is preferable. 
It depends on what you want really. A map based on biomarkers only shows 
you how the patients are mapped in terms of biomarkers alone, a map that 
combines biomarkers and symptom score will show you the "landscape" of 
patients based on combined information. You may have use for one as well 
as the other.

To me personally mapping the patients based on biomarkers alone seems 
intuitively more appealing for a number of reasons:
a) If you want to relate the symtom score to what you get from the 
biomarkers as a whole, you may want to treat the biomarkers and the 
symptom score in a different way in this analysis, reflecting their
different role in interpretation of the results.
b) Chances are that the measurement scales of the biomarkers are 
compatible whereas the symptom score is very different. This can be 
"repaired" to some extent by standardisation/running the PCA on 
correlations, but still this is a somewhat artificial device. If the 
measurement of all the biomarkers are indeed comparable and more variance 
means, in some sense, that the corresponding biomarker is more important, 
you may actually *not* want to standardise them.
c) If you have p biomarkers and one symptom score, given that you 
standardise all variables, the biomarkers as a whole have p times more 
influence on the resulting PCA than the symptom score. Particularly if p 
is large, the symptom score may not change much. Although you may want a 
map that takes into account both symptom score and biomarkers, you may not 
be happy with the fact that the symptoms score, which is a very special 
variable in such a collection, has a weight of just 1/(p+1). (This can be 
amended by reweighting the variables, but this would require a bold 
subjective decision.)

Still, despite all these arguments, what should decide the issue is still 
what interpretation you intend the resulting map/clustering to have, which 
is a biological issue rather than a statistical one.

Best wishes,
-----------------------------------------------------------------------------------
Kim,
I am not sure if this makes sense. But perform a PCA on the p biomarkers. Then choose an apprriate number of principal components and regress the 'selected components' on symptom score.

--------------------------------------------------------------------------------------

_________________________________________________________
You may leave the list at any time by sending the command
SIGNOFF allstat
to [log in to unmask], leaving the subject line blank.
_________________________________________________________


You may leave the list at any time by sending the command

SIGNOFF allstat

to [log in to unmask], leaving the subject line blank.