Print

Print


Hello,

This question concerns SVM-RFE and gene selection from gene expression
data.  I am hoping that someone shed some light.

The characteristic is that the data has few samples(50-100) or
patterns/examples and large number of dimensions/features (~ 22000).

I have used SVMs and one of the data preprocessing steps is to normalise
the inputs in order to avoid badly conditioned or a near singular Hessian
matrices of the data that breaks the QP optimiser.  This is espcially true
of polynomial kernels.

Now, I came across a papert titled - Gene extraction for cancer diagnosis
by support vector machines-An improvement, by TM Huang and V. Kecman.



In this paper this is what is said-


"Interestingly, the gene
with higher standard deviation tends to have higher
ranking. This trend suggests that RFE-SVMs with
sample normalization will likely pick up genes with
expression that vary more across the samples. This
fits well with the assumption that a gene is less
relevant if its expression does not vary much across
the complete data set. Such a general trend cannot
be observed in the bottom graph (where a sample
and the feature normalization are applied) and
there is no connection between the standard deviation
of the gene and gene ranking. This phenomenon
may be due to the fact that the feature normalization
step in the second preprocessing procedure
will ensure that each gene has the same standard deviation. Hence, a gene
with higher standard
deviation originally will no longer be advantageous
over a gene having a smaller standard deviation."


"A general practice for producing good results with
SVMs is to normalize each input (feature) to the one
with mean zero and standard deviation of one as in
the feature normalization step. However, in this
case, this simple rule does not perform as well as
expected: the error rate of applying both sample
and feature normalization is higher than when only
the sample normalization is performed. This phenomenon
may be due to the fact that the feature
normalization step in the second preprocessing procedure
filters out the information about the spread
of the expression for each gene as discussed previously
and this information is helpful for selecting
the relevant gene and classification."


This has thrown me off a little.
When I didn't normalise the inputs, the accuracy of the SVMs dropped.
This is expected since all literature says that one must normalise SVM
inputs.

Can anyone help?  Personally, I don't know what should one do if one
shouldn't normalise data and therefore have badly conditioned kernel
matrices???

Furthermore, i don't understand why would one normalise sample / patterns
to 0 mean and STD 1.  Gene expression normalisation methods such as MAS5,
GC-RMA etc are already used to make sure you begin with normalised chip
data.  This paper mentions nothing of this step, if at all they did any.


Thank You.

Sincerely,
Monika Ray

***********************************************************************
The sweetest songs are those that tell of our saddest thought...

Computational Intelligence Centre, Washington University St. louis, MO
**********************************************************************