Hi all,
here are just a very few comments:
to answers 1,2:
the fraction of the training samples that are support vectors
tends to TWICE the minimum achievable error rate for an optimal
classifier. And this only holds if you use e.g. a Gaussian kernel
and simply sum up the slacks. Otherwise you should, in general,
expect even more support vectors.
to answer 3.c:
It is well known that there exists no best classifier, and consequently,
neither SVMs nor any other method can be a best classifier.
See the book of Devroye, Gyoerfi and Lugosi.
ingo
Balaji Krishnapuram wrote:
>Without a figure it becomes a little difficult to explain these
>questions, but for the sake of an email discussion I'll try to make the
>picture in words. It might be useful to follow this discussion with the
>aid of a picture drawn out and labeled on paper. Further, the discussion
>pertains to linear hyperplane classifiers, except where mentioned at the
>end as related to nonlinear classifiers.
>
>Imagine three *parallel* hyperplanes and let us call them a,b and c
>(these are labeled *in sequence* as you move from infinity towards
>these planes, hit them and then move to +infinity). Further imagine that
>the shortest (ie perpendicular) distance between hyperplanes a and b is
>the same as the distance between hyperplanes b and c.
>
>First consider hard margin SVMs:
>In a hard margin SVM [ie one without any slack variables] we *want* all
>the negative samples to lie on one side, between infinity and a, and
>further we want all positive samples to lie on the opposite side,
>between c and +infinity. Note that though we would like this, it may
>never be possible to separate the positives and negatives, so we may not
>be able to achieve this at all. Nevertheless, if it is possible to find
>such hyperplanes that separate the samples, which one is best? Avoiding
>the theoretical arguments, and in a handwaving sense, the basic
>intuition is this: that hyperplane b is the best classifier, which has
>the maximum margin between its parallel hyperplanes a and c, and still
>manages to separate the positives and negatives so they lie on opposite
>sides of c,a respectively. In this case, the support vectors are simply
>the samples which lie on the hyperplanes a or c respectively(these are
>the training samples that are closest to the class boundary, ie
>hyperplane b.
>
>Now, in general, one may never be able to find any hyperplane capable of
>completely separating the positives and negatives so that they lie on
>opposite sides. In this case we need a soft margin SVM. The mental basic
>picture we drew above still remains valid, except that some training
>samples are allowed to be training errors *in the following sense*. A
>negative sample which lies between b and +infinity is obviously
>incorrectly classified, and should be considered a training error.
>However, there is another class of sample, which is also considered an
>error: this is the set of samples between hyperplanes a and b. Similarly
>for positive training samples any sample which lies between infinity
>and c is an error in the sense mentioned above. The soft margin SVM
>algorithm finds the best tradeoff between reducing the training error in
>this sense and maximizing the margin (which if you pause to think about
>it are tugging in opposite directions so to speak). Besides the samples
>lieing on the margin hyperplanes a,c, the types of training error
>samples mentioned above also become support vectors in a soft margin SVM
>(ie samples that are in the margin as well as the truly misclassified
>samples).
>
>Now to answer question 1,2:
>Some theoretical and practical analysis has been advanced in the past to
>show that, in the limit as the number of training samples tends to
>infinity, the fraction of the training samples that are support vectors
>tends to the minimum achievable error rate for an optimal hyperplane
>classifier.
>
>>From the above discussion you should be able to get an idea which
>samples become support vectors and why. For nonlinear classifiers this
>whole discussion happens in the high dimensional space induced by the
>kernel, but otherwise everthing else is the same. To specifically answer
>your question, a sample is a support vector if (1) it is a training
>error in the sense of misclassification, or (2) it either lies on the
>margin hyperplanes, or inside the margin, even though it is not an
>obviously misclassified sample.
>
>Now for a random draw of a small number of training samples from an
>underlying data generating distribution, the number of support vectors
>is not a fixed deterministic number that is not dependent on the
>training set. Indeed, because of the randomness in the draw of the
>training set, the number of support vectors will be random. However,
>given a fixed training set, (and keeping other kernel parameters, and
>the tradeoff parameter C fixed also) the number of support vectors is fixed.
>
>3.
>a. Just because you know that there are 10K features, there is no reason
>to presuppose that either a linear or a nonlinear kernel is apriori
>better in a problem, *before you see training data*. However, given the
>training data consisting of (feature vector, label) pairs, there are
>many ways to decide which is the best kernel parameter, best
>regularization (ie tradeoff) parameter etc. For example, the simplest,
>albeit less theoretically well motivated approach is to simply compute
>the crossvalidation estimate of the error for each setting of these
>parameters, and to choose the best setting based on these CV error
>estimates. Another, approach in the learning theory world is to use some
>(*very loose*, but in some tenuous sense the community claims as
>reasonable) learning theoretical upper bounds on the error rate and to
>minimize these bounds as a function of the parameters that we want to
>tune. A third, more Bayesian approach is to use approximate evidence (ie
>Bayesian marginal) computations for the same purpose. Nevertheless in
>many smallnlargep problems (just the standard name employed in the
>stats community for classification problems where the number of training
>samples is much less than the number of features) occuring in the
>natural world, it has commonly been empirically observed that the simple
>linear kernel worked best. This is purely empirical evidence on some
>datasets, and *not necessarily true* of any specific data set you may be
>interested in analyzing. For any analysis you may undertake, I strongly
>suggest that you choose parameters using the methods mentioned above.
>
>b. the fact that the training data can be separated by a large margin
>does *not* mean you have evidence of overfitting. Indeed, the whole
>rationale for the SVM is based on this very observation. The whole
>learning theoretical premise based on which the SVM algorithm was
>developed, rests on their claims that (so long as all the samples lie in
>some unit hypersphere), the larger the margin separating the positive
>training samples from the negative samples, the better the guarantees
>they can provide that your test set error rate will not be very high.
>The Vapnik school of thought would say that the fact that you have found
>a large margin classifier on your training set is no reason to suspect
>overfitting, and in fact they would argue that it is reason to suspect
>excellent generalization (ie no overfitting). For more explanation of
>the learning theory see for instance Vapnik's book of 1998
>
>c. There is no evidence, either empirical or even learning theoretical,
>to claim that the SVM is the best classifier. It is also false to claim
>that the SVM is not affected by the curse of dimensionality: both
>empirical evidence and the learning theoretical arguments used to
>advance the SVM in the first place deny this. Specifically, in the
>radius/margin based upper bounds for error rate of hyperplane
>classifiers, *if the radius of the smallest hypersphere enclosing all
>the samples that could be generated were a constant* then we are
>justified in maximizing the margin in a learning algorithm. However,
>between different kernels that result from retaining different numbers
>of features, the radius value changes, so just maximizing the margin
>retaining all the original features is not guaranteed to optimize even
>the radius/margin bounds for test set error rate.
>
>The claim that the SVM is immune to the curse of dimensionality has long
>been dismissed by serious machine learning researchers. It was
>originally advanced in light of the fact that, *as compared to
>algorithms (like backpropagation for training neural networks) that only
>attempt to minimize the training error, without using any form of
>regularization*, the SVM algorithm was empirically found to be much more
>immune to the effect of irrelevant predictor variables. However, keeping
>the training sample size fixed, as the number of irrelevant predictor
>variables (ie features drawn at random in a way uncorrelated to the
>classification label) is increased to infinity, by sheer random chance,
>some of the purely random features will turn out to be well correlated
>to the output class label, thus you will eventually start to overfit as
>the number of these variables increases and starts to dominate your
>truly relevant variables. From this argument, it is clear that
>absolutely no system under the sun can ever be completelyunaffected by
>having an arbitrary number of irrelevant features. SVMs dont even come
>close to this long proclaimed (and also long disproved) holy grail of
>the early nontechnical cheerleaders.
>
>Disclaimer:
>I fundamentally disagree with much of the learning theory behind the
>argument for minimizing upper bounds on error rates (which is sometimes
>held as the theoretical core of the argument for SVMS) and I strongly
>hold that their result is a logical fallacy (see Radford Neal's comments
>at his invited talk at NIPS for a brief explanation or I can post
>another reply about this if required). Though I have written a
>handwaving explanation above in a deliberate attempt to retain clarity
>of explanation, I'm fairly sure that I understand all their logic,
>derivations and claims.
>
>Knowing that this whole area borders on a very old and counterproductive
>holywar between Bayesians and frequentists, I have tried for the most
>part to be respectful and noncontroversial in my explanation, and even
>tried to answer things from the learning theoretical community's point
>of view (since this is after all an SVM group). If despite my best
>efforts, I have given offense to anyone I apologise in advance; let us
>not start a pointless flame war.
>
>Monika Ray wrote:
>
>
>
>>Hello,
>>
>>1. What does the number of support vectors depend on? or is it random 
>>just the data close to the hyperplane?
>>
>>2. I don't understand why not all the datapoints near the optimal
>>hyperplane are not support vectors.
>>
>>3. If you have, say 10000 features, one has to use a nonliner kernel.
>>One
>>would need to use a polynomial kernel because only then you get the
>>combinations of the different datapoints. However, because you have such
>>a large number of features, usign a polynomial kernel will map the data
>>into a feature space in which the points will be very far apart. Then
>>separating the 2 classes will be a very trivial thing as you can have
>>
>>
>many
>
>
>>hyperplanes with large margin as the data is so sparse. I udnerstadn
>>
>>
>that
>
>
>>some may call this overfitting..but in what other kind of kernel can you
>>get the combination of datapoints as you do with a polynomial
>>kernel...this is a kind of a priori knowledge you have about the problem.
>>
>>Sincerely,
>>Monika Ray
>>
>>
>
>
>
>>Quoting Monika Ray <[log in to unmask]>:
>>
>>
>>
>>>Hello,
>>>
>>>SVM is known to defeat the curse of dimensionality...then why is it so
>>>that having a large number of attributes/features in the data is not
>>>desirable when using SVM?
>>>
>>>I thought it should not matter....where am I getting confused?
>>>
>>>Thank You.
>>>
>>>Sincerely,
>>>Monika Ray
>>>
>>>**********************************************************************
>>>*
>>>The sweetest songs are those that tell of our saddest thought...
>>>
>>>Computational Intelligence Centre, Washington University St. louis, MO
>>>**********************************************************************
>>>
>>>
>>>
>>
>
>
>
