[The presence of a large number of
irrelevant features does degrade the performance, especially so if the
kernel is not linear. ]
Yes, this is true for both classification and regression. I should
have mentioned that I would use the linear kernel with all features,
to do the feature selection. And then investigate other kernels.
The only systematic feauture selection methods I've seen are for
classification, based on gradient descent on an error estimate.
The kernel is modified to K(x,z) = sum(x^p.z^p/sigma_p) and the
optimal set of sigma_p's found. Those going to 0 suggest irrelevant
features.
I am not aware of similar work for the regression case.
One way to avoid overfitting with a lot of features is to set C to be
very small. This will also result in an oversmoothed solution fairly
rapidly, and thus permit a number of SVMs to be trained during the
model/feature selection stage.
Lastly, one could follow Mangasarian's classification work and use
a linear programming SVM to directly minimize the number of nonzero
weights.
To come back to the original question: how many features are sensible?
there's no answer, but it is likely to be more than other techniques
can sensibly cope with (other than Bayesian treatments)
one could use PCA (or kernel PCA) to estimate the dimensionality of the
data in the (kernel induced) feature space, and take that as a guide,
but then one's still faced with the combinatorial feature selection
problem
Rgds
Robert
|