unsubscrbe please!

Steven Barrett wrote:

Hello Monika,
1. ...our experinence has been that the #SVs depends upon the complexity of the problem, the representation of it in the features used in training (...sometimes how you select training examples themselves) as well as the quality of the training parameterisation of the SVM itself. We have had non-sparsity of solution problems with SV Classifiers in QSAR and approaches have been created to counter this (see work of Burbidge in the literature/web). R-SVMs (Reduced SVM) do allow post-processing/re-training to reduce the #SVs - the user can specify how many SVs are required or the degree of reduction - but you have to be careful as your solution quality will begin to diminish!!
[ one thing to note is that SVM models can - and often do! - contain SVs that are errors !!! ]
2. My understanding is that SVMs only model the decision surface and so only those points directly at or within the decision surface become part of the solution.
3. If you have 10000 binary features you might only need a linear (no kernel) SVM. But you are correct - appropriate kernel selection is the key to getting the best of SVMs. We have had good results with radial basis function kernels.
Regards,
Steven
Steven J. Barrett, Ph.D

"Monika Ray" <[log in to unmask]>
Sent by: "The Support Vector Machine discussion list" <[log in to unmask]>
16-Feb-2005 02:16

Please respond to "The Support Vector Machine discussion list" <[log in to unmask]>

To
[log in to unmask]

cc

Subject
questions

Hello,
1. What does the number of support vectors depend on? or is it random -
just the data close to the hyperplane?
2. I don't understand why not all the datapoints near the optimal
hyperplane are not support vectors.
3. If you have, say 10000 features, one has to use a non-liner kernel.
One
would need to use a polynomial kernel because only then you get the
combinations of the different datapoints. However, because you have such
a large number of features, usign a polynomial kernel will map the data
into a feature space in which the points will be very far apart. Then
separating the 2 classes will be a very trivial thing as you can have many
hyperplanes with large margin as the data is so sparse. I udnerstadn that
some may call this overfitting..but in what other kind of kernel can you
get the combination of datapoints as you do with a polynomial
kernel...this is a kind of a priori knowledge you have about the problem.
Sincerely,
Monika Ray
***********************************************************************
The sweetest songs are those that tell of our saddest thought...
Computational Intelligence Centre, Washington University St. louis, MO
**********************************************************************