Hi,
I'm trying working on a testing the generalizability of a classifier (linear-SVM) across cohorts and I'm wondering if anyone has any suggestions to improve accuracy. The challenge is that data gathered at different sites has variations in scanning protocol, scanner noise, etc. making classification across sites challenging and almost impossible with large datasets. I haven't found any great solutions in what I've read.
One way to frame this problem: X and Y are two datasets of morphometric data with control and patient groups obtained at two different sites. They have the same number of variables (features). Unfortunately, the two sites have different scanners and protocols. Dataset X is split into a training and validation set (80/20 split) to train and test a classifier. The classifier's accuracy is assessed through cross-validation and achieves an accuracy of 80%. The researcher feeling confident in their classifier's ability to generalize tests it on dataset Y. The classifier in this case achieves an accuracy below chance. Puzzled by the result, the researcher wants to try again on another dataset, Z. They then test the classifier trained on dataset X to predict between controls and patients in dataset Z and achieve an accuracy of 55%. How would this researcher improve their classification results using a reliable method?
I'm currently applying the method "Dysregulated Brain Dynamics in a Triple-Network Saliency Model of Schizophrenia and Its Relation to Psychosis" described in their supplementary materials. I believe I've correctly replicated their analysis in R and it's somewhat successful in increasing accuracy (58% in my test run). There are a few problems that I see with this method that aren't addressed in the paper since this method is auxiliary and was hoping someone would have a better solution. Here's their description of their solution: "One critical goal for developing robust biomarkers is to design a generalized classifier using data from one cohort/site to distinguish patients with schizophrenia from controls in other cohort/sites. The challenging and currently unresolved issue here is that data collected from multiple sites are subject to systematic variations (e.g. different scanning protocols). To address this issue, we developed a novel sub-space data matching algorithm to mitigate the systematic cross-site variations. Specifically, we first applied Principal Component Analysis (PCA) to data from each site to obtain their own sets of principal components (PCs). In principle, if two datasets were sampled from the same distribution, their PCs should be one-to-one matched. However, due to cross-site variations, this is not necessarily the case. To rematch the PCs, we performed a correlation analysis between the PCs of the training and test data and aligned each PC in the test data to a corresponding one in the training data based on the maximal Pearson correlation value. The training and test data were then separately transformed by projecting the original data to the matched PCs. A CART classifier was trained and tested on the transformed data."
I'm more than happy to share my code and a subsample of my data if anyone would like to replicate.
Cheers,
Mohan Gupta
www.mohanwugupta.com
########################################################################
To unsubscribe from the FSL list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=FSL&A=1
|