Email discussion lists for the UK Education and Research communities

## SUPPORT-VECTOR-MACHINES@JISCMAIL.AC.UK

#### View:

 Message: [ First | Previous | Next | Last ] By Topic: [ First | Previous | Next | Last ] By Author: [ First | Previous | Next | Last ] Font: Proportional Font

#### Options

Subject:

Re: svm and curse of dimensionality

From:

Date:

Thu, 17 Feb 2005 14:51:31 -0700

Content-Type:

text/plain

Parts/Attachments:

 text/plain (259 lines)
 ```Hi all, here are just a very few comments: to answers 1,2: the fraction of the training samples that are support vectors tends to TWICE the minimum achievable error rate for an optimal classifier. And this only holds if you use e.g. a Gaussian kernel and simply sum up the slacks. Otherwise you should, in general, expect even more support vectors. to answer 3.c: It is well known that there exists no best classifier, and consequently, neither SVMs nor any other method can be a best classifier. See the book of Devroye, Gyoerfi and Lugosi. ingo Balaji Krishnapuram wrote: >Without a figure it becomes a little difficult to explain these >questions, but for the sake of an email discussion I'll try to make the >picture in words. It might be useful to follow this discussion with the >aid of a picture drawn out and labeled on paper. Further, the discussion >pertains to linear hyperplane classifiers, except where mentioned at the >end as related to nonlinear classifiers. > >Imagine three *parallel* hyperplanes and let us call them a,b and c >(these are labeled *in sequence* as you move from -infinity towards >these planes, hit them and then move to +infinity). Further imagine that >the shortest (ie perpendicular) distance between hyperplanes a and b is >the same as the distance between hyperplanes b and c. > >First consider hard margin SVMs: >In a hard margin SVM [ie one without any slack variables] we *want* all >the negative samples to lie on one side, between -infinity and a, and >further we want all positive samples to lie on the opposite side, >between c and +infinity. Note that though we would like this, it may >never be possible to separate the positives and negatives, so we may not >be able to achieve this at all. Nevertheless, if it is possible to find >such hyperplanes that separate the samples, which one is best? Avoiding >the theoretical arguments, and in a handwaving sense, the basic >intuition is this: that hyperplane b is the best classifier, which has >the maximum margin between its parallel hyperplanes a and c, and still >manages to separate the positives and negatives so they lie on opposite >sides of c,a respectively. In this case, the support vectors are simply >the samples which lie on the hyperplanes a or c respectively(these are >the training samples that are closest to the class boundary, ie >hyperplane b. > >Now, in general, one may never be able to find any hyperplane capable of >completely separating the positives and negatives so that they lie on >opposite sides. In this case we need a soft margin SVM. The mental basic >picture we drew above still remains valid, except that some training >samples are allowed to be training errors *in the following sense*. A >negative sample which lies between b and +infinity is obviously >incorrectly classified, and should be considered a training error. >However, there is another class of sample, which is also considered an >error: this is the set of samples between hyperplanes a and b. Similarly >for positive training samples any sample which lies between -infinity >and c is an error in the sense mentioned above. The soft margin SVM >algorithm finds the best tradeoff between reducing the training error in >this sense and maximizing the margin (which if you pause to think about >it are tugging in opposite directions so to speak). Besides the samples >lieing on the margin hyperplanes a,c, the types of training error >samples mentioned above also become support vectors in a soft margin SVM >(ie samples that are in the margin as well as the truly misclassified >samples). > >Now to answer question 1,2: >Some theoretical and practical analysis has been advanced in the past to >show that, in the limit as the number of training samples tends to >infinity, the fraction of the training samples that are support vectors >tends to the minimum achievable error rate for an optimal hyperplane >classifier. > >>From the above discussion you should be able to get an idea which >samples become support vectors and why. For nonlinear classifiers this >whole discussion happens in the high dimensional space induced by the >kernel, but otherwise everthing else is the same. To specifically answer >your question, a sample is a support vector if (1) it is a training >error in the sense of misclassification, or (2) it either lies on the >margin hyperplanes, or inside the margin, even though it is not an >obviously misclassified sample. > >Now for a random draw of a small number of training samples from an >underlying data generating distribution, the number of support vectors >is not a fixed deterministic number that is not dependent on the >training set. Indeed, because of the randomness in the draw of the >training set, the number of support vectors will be random. However, >given a fixed training set, (and keeping other kernel parameters, and >the tradeoff parameter C fixed also) the number of support vectors is fixed. > >3. >a. Just because you know that there are 10K features, there is no reason >to pre-suppose that either a linear or a non-linear kernel is a-priori >better in a problem, *before you see training data*. However, given the >training data consisting of (feature vector, label) pairs, there are >many ways to decide which is the best kernel parameter, best >regularization (ie tradeoff) parameter etc. For example, the simplest, >albeit less theoretically well motivated approach is to simply compute >the cross-validation estimate of the error for each setting of these >parameters, and to choose the best setting based on these CV error >estimates. Another, approach in the learning theory world is to use some >(*very loose*, but in some tenuous sense the community claims as >reasonable) learning theoretical upper bounds on the error rate and to >minimize these bounds as a function of the parameters that we want to >tune. A third, more Bayesian approach is to use approximate evidence (ie >Bayesian marginal) computations for the same purpose. Nevertheless in >many small-n-large-p problems (just the standard name employed in the >stats community for classification problems where the number of training >samples is much less than the number of features) occuring in the >natural world, it has commonly been empirically observed that the simple >linear kernel worked best. This is purely empirical evidence on some >datasets, and *not necessarily true* of any specific data set you may be >interested in analyzing. For any analysis you may undertake, I strongly >suggest that you choose parameters using the methods mentioned above. > >b. the fact that the training data can be separated by a large margin >does *not* mean you have evidence of overfitting. Indeed, the whole >rationale for the SVM is based on this very observation. The whole >learning theoretical premise based on which the SVM algorithm was >developed, rests on their claims that (so long as all the samples lie in >some unit hypersphere), the larger the margin separating the positive >training samples from the negative samples, the better the guarantees >they can provide that your test set error rate will not be very high. >The Vapnik school of thought would say that the fact that you have found >a large margin classifier on your training set is no reason to suspect >overfitting, and in fact they would argue that it is reason to suspect >excellent generalization (ie no overfitting). For more explanation of >the learning theory see for instance Vapnik's book of 1998 > >c. There is no evidence, either empirical or even learning theoretical, >to claim that the SVM is the best classifier. It is also false to claim >that the SVM is not affected by the curse of dimensionality: both >empirical evidence and the learning theoretical arguments used to >advance the SVM in the first place deny this. Specifically, in the >radius/margin based upper bounds for error rate of hyperplane >classifiers, *if the radius of the smallest hypersphere enclosing all >the samples that could be generated were a constant* then we are >justified in maximizing the margin in a learning algorithm. However, >between different kernels that result from retaining different numbers >of features, the radius value changes, so just maximizing the margin >retaining all the original features is not guaranteed to optimize even >the radius/margin bounds for test set error rate. > >The claim that the SVM is immune to the curse of dimensionality has long >been dismissed by serious machine learning researchers. It was >originally advanced in light of the fact that, *as compared to >algorithms (like backpropagation for training neural networks) that only >attempt to minimize the training error, without using any form of >regularization*, the SVM algorithm was empirically found to be much more >immune to the effect of irrelevant predictor variables. However, keeping >the training sample size fixed, as the number of irrelevant predictor >variables (ie features drawn at random in a way uncorrelated to the >classification label) is increased to infinity, by sheer random chance, >some of the purely random features will turn out to be well correlated >to the output class label, thus you will eventually start to overfit as >the number of these variables increases and starts to dominate your >truly relevant variables. From this argument, it is clear that >absolutely no system under the sun can ever be completelyunaffected by >having an arbitrary number of irrelevant features. SVMs dont even come >close to this long proclaimed (and also long disproved) holy grail of >the early non-technical cheerleaders. > >Disclaimer: >I fundamentally disagree with much of the learning theory behind the >argument for minimizing upper bounds on error rates (which is sometimes >held as the theoretical core of the argument for SVMS) and I strongly >hold that their result is a logical fallacy (see Radford Neal's comments >at his invited talk at NIPS for a brief explanation or I can post >another reply about this if required). Though I have written a >hand-waving explanation above in a deliberate attempt to retain clarity >of explanation, I'm fairly sure that I understand all their logic, >derivations and claims. > >Knowing that this whole area borders on a very old and counterproductive >holy-war between Bayesians and frequentists, I have tried for the most >part to be respectful and non-controversial in my explanation, and even >tried to answer things from the learning theoretical community's point >of view (since this is after all an SVM group). If despite my best >efforts, I have given offense to anyone I apologise in advance; let us >not start a pointless flame war. > >Monika Ray wrote: > > > >>Hello, >> >>1. What does the number of support vectors depend on? or is it random - >>just the data close to the hyperplane? >> >>2. I don't understand why not all the datapoints near the optimal >>hyperplane are not support vectors. >> >>3. If you have, say 10000 features, one has to use a non-liner kernel. >>One >>would need to use a polynomial kernel because only then you get the >>combinations of the different datapoints. However, because you have such >>a large number of features, usign a polynomial kernel will map the data >>into a feature space in which the points will be very far apart. Then >>separating the 2 classes will be a very trivial thing as you can have >> >> >many > > >>hyperplanes with large margin as the data is so sparse. I udnerstadn >> >> >that > > >>some may call this overfitting..but in what other kind of kernel can you >>get the combination of datapoints as you do with a polynomial >>kernel...this is a kind of a priori knowledge you have about the problem. >> >>Sincerely, >>Monika Ray >> >> > > > >>Quoting Monika Ray <[log in to unmask]>: >> >> >> >>>Hello, >>> >>>SVM is known to defeat the curse of dimensionality...then why is it so >>>that having a large number of attributes/features in the data is not >>>desirable when using SVM? >>> >>>I thought it should not matter....where am I getting confused? >>> >>>Thank You. >>> >>>Sincerely, >>>Monika Ray >>> >>>********************************************************************** >>>* >>>The sweetest songs are those that tell of our saddest thought... >>> >>>Computational Intelligence Centre, Washington University St. louis, MO >>>********************************************************************** >>> >>> >>> >> > > > ```

#### JiscMail Tools

JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice