Hello, This question concerns SVM-RFE and gene selection from gene expression data. I am hoping that someone shed some light. The characteristic is that the data has few samples(50-100) or patterns/examples and large number of dimensions/features (~ 22000). I have used SVMs and one of the data preprocessing steps is to normalise the inputs in order to avoid badly conditioned or a near singular Hessian matrices of the data that breaks the QP optimiser. This is espcially true of polynomial kernels. Now, I came across a papert titled - Gene extraction for cancer diagnosis by support vector machines-An improvement, by TM Huang and V. Kecman. In this paper this is what is said- "Interestingly, the gene with higher standard deviation tends to have higher ranking. This trend suggests that RFE-SVMs with sample normalization will likely pick up genes with expression that vary more across the samples. This fits well with the assumption that a gene is less relevant if its expression does not vary much across the complete data set. Such a general trend cannot be observed in the bottom graph (where a sample and the feature normalization are applied) and there is no connection between the standard deviation of the gene and gene ranking. This phenomenon may be due to the fact that the feature normalization step in the second preprocessing procedure will ensure that each gene has the same standard deviation. Hence, a gene with higher standard deviation originally will no longer be advantageous over a gene having a smaller standard deviation." "A general practice for producing good results with SVMs is to normalize each input (feature) to the one with mean zero and standard deviation of one as in the feature normalization step. However, in this case, this simple rule does not perform as well as expected: the error rate of applying both sample and feature normalization is higher than when only the sample normalization is performed. This phenomenon may be due to the fact that the feature normalization step in the second preprocessing procedure filters out the information about the spread of the expression for each gene as discussed previously and this information is helpful for selecting the relevant gene and classification." This has thrown me off a little. When I didn't normalise the inputs, the accuracy of the SVMs dropped. This is expected since all literature says that one must normalise SVM inputs. Can anyone help? Personally, I don't know what should one do if one shouldn't normalise data and therefore have badly conditioned kernel matrices??? Furthermore, i don't understand why would one normalise sample / patterns to 0 mean and STD 1. Gene expression normalisation methods such as MAS5, GC-RMA etc are already used to make sure you begin with normalised chip data. This paper mentions nothing of this step, if at all they did any. Thank You. Sincerely, Monika Ray *********************************************************************** The sweetest songs are those that tell of our saddest thought... Computational Intelligence Centre, Washington University St. louis, MO **********************************************************************