Print

Print


Without a figure it becomes a little difficult to explain these
questions, but for the sake of an email discussion I'll try to make the
picture in words. It might be useful to follow this discussion with the
aid of a picture drawn out and labeled on paper. Further, the discussion
pertains to linear hyperplane classifiers, except where mentioned at the
end as related to nonlinear classifiers.

Imagine three *parallel* hyperplanes and let us call them a,b and c
(these are labeled *in sequence* as you move from -infinity towards
these planes, hit them and then move to +infinity). Further imagine that
the shortest (ie perpendicular) distance between hyperplanes a and b is
the same as the distance between hyperplanes b and c.

First consider hard margin SVMs:
In a hard margin SVM [ie one without any slack variables] we *want* all
the negative samples to lie on one side, between -infinity and a, and
further we want all positive samples to lie on the opposite side,
between c and +infinity. Note that though we would like this, it may
never be possible to separate the positives and negatives, so we may not
be able to achieve this at all. Nevertheless, if it is possible to find
such hyperplanes that separate the samples, which one is best? Avoiding
the theoretical arguments, and in a handwaving sense, the basic
intuition is this: that hyperplane b is the best classifier, which has
the maximum margin between its parallel hyperplanes a and c, and still
manages to separate the positives and negatives so they lie on opposite
sides of c,a respectively. In this case, the support vectors are simply
the samples which lie on the hyperplanes a or c respectively(these are
the training samples that are closest to the class boundary, ie
hyperplane b.

Now, in general, one may never be able to find any hyperplane capable of
completely separating the positives and negatives so that they lie on
opposite sides. In this case we need a soft margin SVM. The mental basic
picture we drew above still remains valid, except that some training
samples are allowed to be training errors *in the following sense*. A
negative sample which lies between b and +infinity is obviously
incorrectly classified, and should be considered a training error.
However, there is another class of sample, which is also considered an
error: this is the set of samples between hyperplanes a and b. Similarly
for positive training samples any sample which lies between -infinity
and c is an error in the sense mentioned above. The soft margin SVM
algorithm finds the best tradeoff between reducing the training error in
this sense and maximizing the margin (which if you pause to think about
it are tugging in opposite directions so to speak). Besides the samples
lieing on the margin hyperplanes a,c, the types of training error
samples mentioned above also become support vectors in a soft margin SVM
(ie samples that are in the margin as well as the truly misclassified
samples).

Now to answer question 1,2:
Some theoretical and practical analysis has been advanced in the past to
show that, in the limit as the number of training samples tends to
infinity, the fraction of the training samples that are support vectors
tends to the minimum achievable error rate for an optimal hyperplane
classifier.

From the above discussion you should be able to get an idea which
samples become support vectors and why. For nonlinear classifiers this
whole discussion happens in the high dimensional space induced by the
kernel, but otherwise everthing else is the same. To specifically answer
your question, a sample is a support vector if (1) it is a training
error in the sense of misclassification, or (2) it either lies on the
margin hyperplanes, or inside the margin, even though it is not an
obviously misclassified sample.

Now for a random draw of a small number of training samples from an
underlying data generating distribution, the number of support vectors
is not a fixed deterministic number that is not dependent on the
training set. Indeed, because of the randomness in the draw of the
training set, the number of support vectors will be random. However,
given a fixed training set, (and keeping other kernel parameters, and
the tradeoff parameter C fixed also) the number of support vectors is fixed.

3.
a. Just because you know that there are 10K features, there is no reason
to pre-suppose that either a linear or a non-linear kernel is a-priori
better in a problem, *before you see training data*. However, given the
training data consisting of (feature vector, label) pairs, there are
many ways to decide which is the best kernel parameter, best
regularization (ie tradeoff) parameter etc. For example, the simplest,
albeit less theoretically well motivated approach is to simply compute
the cross-validation estimate of the error for each setting of these
parameters, and to choose the best setting based on these CV error
estimates. Another, approach in the learning theory world is to use some
(*very loose*, but in some tenuous sense the community claims as
reasonable) learning theoretical upper bounds on the error rate and to
minimize these bounds as a function of the parameters that we want to
tune. A third, more Bayesian approach is to use approximate evidence (ie
Bayesian marginal) computations for the same purpose. Nevertheless in
many small-n-large-p problems (just the standard name employed in the
stats community for classification problems where the number of training
samples is much less than the number of features) occuring in the
natural world, it has commonly been empirically observed that the simple
linear kernel worked best. This is purely empirical evidence on some
datasets, and *not necessarily true* of any specific data set you may be
interested in analyzing. For any analysis you may undertake, I strongly
suggest that you choose parameters using the methods mentioned above.

b. the fact that the training data can be separated by a large margin
does *not* mean you have evidence of overfitting. Indeed, the whole
rationale for the SVM is based on this very observation. The whole
learning theoretical premise based on which the SVM algorithm was
developed, rests on their claims that (so long as all the samples lie in
some unit hypersphere), the larger the margin separating the positive
training samples from the negative samples, the better the guarantees
they can provide that your test set error rate will not be very high.
The Vapnik school of thought would say that the fact that you have found
a large margin classifier on your training set is no reason to suspect
overfitting, and in fact they would argue that it is reason to suspect
excellent generalization (ie no overfitting). For more explanation of
the learning theory see for instance Vapnik's book of 1998

c. There is no evidence, either empirical or even learning theoretical,
to claim that the SVM is the best classifier. It is also false to claim
that the SVM is not affected by the curse of dimensionality: both
empirical evidence and the learning theoretical arguments used to
advance the SVM in the first place deny this. Specifically, in the
radius/margin based upper bounds for error rate of hyperplane
classifiers, *if the radius of the smallest hypersphere enclosing all
the samples that could be generated were a constant* then we are
justified in maximizing the margin in a learning algorithm. However,
between different kernels that result from retaining different numbers
of features, the radius value changes, so just maximizing the margin
retaining all the original features is not guaranteed to optimize even
the radius/margin bounds for test set error rate.

The claim that the SVM is immune to the curse of dimensionality has long
been dismissed by serious machine learning researchers. It was
originally advanced in light of the fact that, *as compared to
algorithms (like backpropagation for training neural networks) that only
attempt to minimize the training error, without using any form of
regularization*, the SVM algorithm was empirically found to be much more
immune to the effect of irrelevant predictor variables. However, keeping
the training sample size fixed, as the number of irrelevant predictor
variables (ie features drawn at random in a way uncorrelated to the
classification label) is increased to infinity, by sheer random chance,
some of the purely random features will turn out to be well correlated
to the output class label, thus you will eventually start to overfit as
the number of these variables increases and starts to dominate your
truly relevant variables. From this argument, it is clear that
absolutely no system under the sun can ever be completelyunaffected by
having an arbitrary number of irrelevant features. SVMs dont even come
close to this long proclaimed (and also long disproved) holy grail of
the early non-technical cheerleaders.

Disclaimer:
I fundamentally disagree with much of the learning theory behind the
argument for minimizing upper bounds on error rates (which is sometimes
held as the theoretical core of the argument for SVMS) and I strongly
hold that their result is a logical fallacy (see Radford Neal's comments
at his invited talk at NIPS for a brief explanation or I can post
another reply about this if required). Though I have written a
hand-waving explanation above in a deliberate attempt to retain clarity
of explanation, I'm fairly sure that I understand all their logic,
derivations and claims.

Knowing that this whole area borders on a very old and counterproductive
holy-war between Bayesians and frequentists, I have tried for the most
part to be respectful and non-controversial in my explanation, and even
tried to answer things from the learning theoretical community's point
of view (since this is after all an SVM group). If despite my best
efforts, I have given offense to anyone I apologise in advance; let us
not start a pointless flame war.

Monika Ray wrote:

> Hello,
>
> 1. What does the number of support vectors depend on? or is it random -
> just the data close to the hyperplane?
>
> 2. I don't understand why not all the datapoints near the optimal
> hyperplane are not support vectors.
>
> 3. If you have, say 10000 features, one has to use a non-liner kernel.
> One
> would need to use a polynomial kernel because only then you get the
> combinations of the different datapoints.  However, because you have such
> a large number of features, usign a polynomial kernel will map the data
> into a feature space in which the points will be very far apart.  Then
> separating the 2 classes will be a very trivial thing as you can have
many
> hyperplanes with large margin as the data is so sparse.  I udnerstadn
that
> some may call this overfitting..but in what other kind of kernel can you
> get the combination of datapoints as you do with a polynomial
> kernel...this is a kind of a priori knowledge you have about the problem.
>
> Sincerely,
> Monika Ray

> Quoting Monika Ray <[log in to unmask]>:
>
>> Hello,
>>
>> SVM is known to defeat the curse of dimensionality...then why is it so
>> that having a large number of attributes/features in the data is not
>> desirable when using SVM?
>>
>> I thought it should not matter....where am I getting confused?
>>
>> Thank You.
>>
>> Sincerely,
>> Monika Ray
>>
>> **********************************************************************
>> *
>> The sweetest songs are those that tell of our saddest thought...
>>
>> Computational Intelligence Centre, Washington University St. louis, MO
>> **********************************************************************
>>
>