I have a method which although a KDD method in its own right can
also be used to create the nonlinear mapping into the feature space
that is custom-fitted to the data. After that SVM can be used.
If anyone wants it I can send the pdf file. If there are too many
requests I will put it on my site.
"Burbidge, Robert" wrote:
>
> [The presence of a large number of
> irrelevant features does degrade the performance, especially so if the
> kernel is not linear. ]
>
> Yes, this is true for both classification and regression. I should
> have mentioned that I would use the linear kernel with all features,
> to do the feature selection. And then investigate other kernels.
>
> The only systematic feauture selection methods I've seen are for
> classification, based on gradient descent on an error estimate.
> The kernel is modified to K(x,z) = sum(x^p.z^p/sigma_p) and the
> optimal set of sigma_p's found. Those going to 0 suggest irrelevant
> features.
> I am not aware of similar work for the regression case.
>
> One way to avoid overfitting with a lot of features is to set C to be
> very small. This will also result in an oversmoothed solution fairly
> rapidly, and thus permit a number of SVMs to be trained during the
> model/feature selection stage.
>
> Lastly, one could follow Mangasarian's classification work and use
> a linear programming SVM to directly minimize the number of nonzero
> weights.
>
> To come back to the original question: how many features are sensible?
> there's no answer, but it is likely to be more than other techniques
> can sensibly cope with (other than Bayesian treatments)
> one could use PCA (or kernel PCA) to estimate the dimensionality of the
> data in the (kernel induced) feature space, and take that as a guide,
> but then one's still faced with the combinatorial feature selection
> problem
>
> Rgds
>
> Robert
--
Regards,
Mark
Computer Science
[log in to unmask]
|