Dear all, Many thanks to the following for replying to my Factor Analysis question posted on 24th March (below): John Sorkin Abhaya Indrayan Michael Greenacre Fionn Murtagh Francesca Greselin Leacky Kamau Peter Das Christian Hennig Isaac Dialsingh I attach their replies below (in no particular order) for those who are interested. Best Regards, Kim Dr Kim Pearce PhD, CStat Senior Statistician Haematological Sciences Institute of Cellular Medicine William Leech Building Medical School Newcastle University Framlington Place Newcastle upon Tyne NE2 4HH Tel: (0044) (0)191 208 8142 ----- Original message ----- From: Kim Pearce <[log in to unmask]> To: Subject: Factor Analysis/PCA : form of data Date: Tue, 24 Mar 2015 15:34:38 +0000 Hi everyone, A quick question.... In a hypothetical data set, say we have N patients and p biomarkers. The biomarkers each take a continuous form of measurement. There is also another variable, say, "symptom score" which is recorded for each patient with values ranging from 0 to 50. I want to generate of 'map' of these patients where patients clustering together have similar characteristics. I intend to use either factor analysis or PCA. Now I had originally thought of doing this analysis using the p biomarkers, generating my plot and superimposing the 'symptom score' on each point on the plot to see if clusters of individuals also have similar symptom scores....however I then started wondering if I could actually do a factor analysis/PCA using the p biomarkers *and* symptom score together in an analysis. Any views are much appreciated. Many thanks in advance. Kind regards, Kim. -----Original Message----- Hi Kim, Following on from John Sorkin's suggestion, you may want to look into methods for matrix reordering, seriation, and visualisation; e.g. Liiv, I. (2010). Seriation and matrix reordering methods: An historical overview. Statistical Analysis and Data Mining, 3, 70–91. Liiv, I., Opik, R., Ubi, J., & Stasko, J. (2012). Visual matrix explorer for collaborative seriation. Wiley Interdisciplinary Reviews: Computational Statistics, 4, 85-97. Wu, H. M., Tien, Y. J. & Chen, C. H. (2010). GAP: A graphical environment for matrix visualization and cluster analysis. Computational Statistics and Data Analysis, 54, 767-778. [See also, "GAP: Generalized Association Plots", http://gap.stat.sinica.edu.tw/Software/GAP/ ] If you are interested in examining the association between the biomarker variables and the symptom score variable, then you may also want to take a look at methods for visualising correlation matrices; e.g. http://weitaiyun.blogspot.com/2009/03/visulization- of-correlation-matrix.html . (With the usual caveats about correlation coefficients.) Kind regards, ------------------------------------------------------------ Dear Kim, I am interested in working with you on this kind of problems. With some colleagues, we have developed clustering techniques which perform clusterwise regression. Hence, say, the symptom score could be explained in a different way in the group of healthy people and in the group of non-healthy people. Please look at the attached papers. In the first one we introduce robust regression with local models for groups of patients having similar features. Hence it could be very effective in discriminating the two groups of people and in providing a way to interpret how the symptom score is related to the p biomarkers. In the second one we propose a model for estimating latent factors in groups. This means that correlations between the p biomarkers could be explained by some few underlying factors, which are the cause, or the explanation of the observed biomarkers. Hence both of your aims are covered. Both are robustly estimated models, hence they are able to deal with departures from the underlying normality assumptions, which in many cases are too strong for real data. If you are interested, we could have a chat on these advanced methods. Cheers, P.S. The heatmap()function display the results of a hierarchical clustering by permuting the rows and the columns of a matrix to place similar values near each other according to the clustering. Hierarchical clustering could be very effective in some case, but anyway it depends on some distance you choose. Our methodologies are instead based on providing a model fit for the data, in such a way that you can make previsions, as well as discover and interpret correlations between the involved measured variables. The advantage of having a model, if this is a good model for the data at hand, is that you are able to make interpretations. ----------------------------------------------------------------- Kim: Please see if you would like to consider cluster analysis in place of factor analysis for numerical clusterization, and then plot. ----------------------------------------------------------------- You could do a PCA where the first dimension is constrained to be the symptom score. This concentrates that one variable on a single axis so it's not mixed up with the others on different principal axes. I did something similar in a different context, you can see the write-up in my book Biplots in Practice at : http://multivariatestatistics.org/biplots.html go to chapter 12. Only problem is I don't think I've posted the R code for that chapter yet, I wanted to completely redesign and simplify the presentation of the code and haven't had time! Good luck, ------------------------------------------------------------------- I bet a PCA for which you 'Partial out' the symptom scores is possible (atleast in SAS) depending on the number of unique scores that you have Kind regards, -------------------------------------------------------------------- Kim A heatmap may do exactly what you are looking for, at least as far as giving a visual representation of the clustering. I suggest you look at the R routine heatmap. --------------------------------------------------------------------- A few remarks. The difference between factor analysis and principal components analysis is often unclear. I see factor analysis is a relic from an age without computer power. You postulated one or a few factors and saw each variable as a linear combination of these factors, plus an own variance. Statistical programs sometimes quietly do PCA when you ask for factor analysis. If you map the patients, in a sense you use PCA to reduce the number of variables, which is a good thing. Following your original idea, why not first do a regression of the symptom score on the first few principal components? It would help you interpret what happens when you add the symptom score in the PCA kettle to the biomarkers. Cheers, -------------------------------------------------------------------- Dear Kim, there is no purely statistical reason why one or the other is preferable. It depends on what you want really. A map based on biomarkers only shows you how the patients are mapped in terms of biomarkers alone, a map that combines biomarkers and symptom score will show you the "landscape" of patients based on combined information. You may have use for one as well as the other. To me personally mapping the patients based on biomarkers alone seems intuitively more appealing for a number of reasons: a) If you want to relate the symtom score to what you get from the biomarkers as a whole, you may want to treat the biomarkers and the symptom score in a different way in this analysis, reflecting their different role in interpretation of the results. b) Chances are that the measurement scales of the biomarkers are compatible whereas the symptom score is very different. This can be "repaired" to some extent by standardisation/running the PCA on correlations, but still this is a somewhat artificial device. If the measurement of all the biomarkers are indeed comparable and more variance means, in some sense, that the corresponding biomarker is more important, you may actually *not* want to standardise them. c) If you have p biomarkers and one symptom score, given that you standardise all variables, the biomarkers as a whole have p times more influence on the resulting PCA than the symptom score. Particularly if p is large, the symptom score may not change much. Although you may want a map that takes into account both symptom score and biomarkers, you may not be happy with the fact that the symptoms score, which is a very special variable in such a collection, has a weight of just 1/(p+1). (This can be amended by reweighting the variables, but this would require a bold subjective decision.) Still, despite all these arguments, what should decide the issue is still what interpretation you intend the resulting map/clustering to have, which is a biological issue rather than a statistical one. Best wishes, ----------------------------------------------------------------------------------- Kim, I am not sure if this makes sense. But perform a PCA on the p biomarkers. Then choose an apprriate number of principal components and regress the 'selected components' on symptom score. -------------------------------------------------------------------------------------- _________________________________________________________ You may leave the list at any time by sending the command SIGNOFF allstat to [log in to unmask], leaving the subject line blank. _________________________________________________________ You may leave the list at any time by sending the command SIGNOFF allstat to [log in to unmask], leaving the subject line blank.