JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for SUPPORT-VECTOR-MACHINES Archives


SUPPORT-VECTOR-MACHINES Archives

SUPPORT-VECTOR-MACHINES Archives


SUPPORT-VECTOR-MACHINES@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

SUPPORT-VECTOR-MACHINES Home

SUPPORT-VECTOR-MACHINES Home

SUPPORT-VECTOR-MACHINES  2005

SUPPORT-VECTOR-MACHINES 2005

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Re: svm and curse of dimensionality

From:

Ingo Steinwart <[log in to unmask]>

Reply-To:

The Support Vector Machine discussion list <[log in to unmask]>

Date:

Thu, 17 Feb 2005 14:51:31 -0700

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (259 lines)

Hi all,

here are just a very few comments:

to answers 1,2:
the fraction of the training samples that are support vectors
tends to TWICE the minimum achievable error rate for an optimal
classifier. And this only holds if you use e.g. a Gaussian kernel
and simply sum up the slacks. Otherwise you should, in general,
expect even more support vectors.

to answer 3.c:
It is well known that there exists no best classifier, and consequently,
neither SVMs nor any other method can be a best classifier.
See the book of Devroye, Gyoerfi and Lugosi.

ingo




Balaji Krishnapuram wrote:

>Without a figure it becomes a little difficult to explain these
>questions, but for the sake of an email discussion I'll try to make the
>picture in words. It might be useful to follow this discussion with the
>aid of a picture drawn out and labeled on paper. Further, the discussion
>pertains to linear hyperplane classifiers, except where mentioned at the
>end as related to nonlinear classifiers.
>
>Imagine three *parallel* hyperplanes and let us call them a,b and c
>(these are labeled *in sequence* as you move from -infinity towards
>these planes, hit them and then move to +infinity). Further imagine that
>the shortest (ie perpendicular) distance between hyperplanes a and b is
>the same as the distance between hyperplanes b and c.
>
>First consider hard margin SVMs:
>In a hard margin SVM [ie one without any slack variables] we *want* all
>the negative samples to lie on one side, between -infinity and a, and
>further we want all positive samples to lie on the opposite side,
>between c and +infinity. Note that though we would like this, it may
>never be possible to separate the positives and negatives, so we may not
>be able to achieve this at all. Nevertheless, if it is possible to find
>such hyperplanes that separate the samples, which one is best? Avoiding
>the theoretical arguments, and in a handwaving sense, the basic
>intuition is this: that hyperplane b is the best classifier, which has
>the maximum margin between its parallel hyperplanes a and c, and still
>manages to separate the positives and negatives so they lie on opposite
>sides of c,a respectively. In this case, the support vectors are simply
>the samples which lie on the hyperplanes a or c respectively(these are
>the training samples that are closest to the class boundary, ie
>hyperplane b.
>
>Now, in general, one may never be able to find any hyperplane capable of
>completely separating the positives and negatives so that they lie on
>opposite sides. In this case we need a soft margin SVM. The mental basic
>picture we drew above still remains valid, except that some training
>samples are allowed to be training errors *in the following sense*. A
>negative sample which lies between b and +infinity is obviously
>incorrectly classified, and should be considered a training error.
>However, there is another class of sample, which is also considered an
>error: this is the set of samples between hyperplanes a and b. Similarly
>for positive training samples any sample which lies between -infinity
>and c is an error in the sense mentioned above. The soft margin SVM
>algorithm finds the best tradeoff between reducing the training error in
>this sense and maximizing the margin (which if you pause to think about
>it are tugging in opposite directions so to speak). Besides the samples
>lieing on the margin hyperplanes a,c, the types of training error
>samples mentioned above also become support vectors in a soft margin SVM
>(ie samples that are in the margin as well as the truly misclassified
>samples).
>
>Now to answer question 1,2:
>Some theoretical and practical analysis has been advanced in the past to
>show that, in the limit as the number of training samples tends to
>infinity, the fraction of the training samples that are support vectors
>tends to the minimum achievable error rate for an optimal hyperplane
>classifier.
>
>>From the above discussion you should be able to get an idea which
>samples become support vectors and why. For nonlinear classifiers this
>whole discussion happens in the high dimensional space induced by the
>kernel, but otherwise everthing else is the same. To specifically answer
>your question, a sample is a support vector if (1) it is a training
>error in the sense of misclassification, or (2) it either lies on the
>margin hyperplanes, or inside the margin, even though it is not an
>obviously misclassified sample.
>
>Now for a random draw of a small number of training samples from an
>underlying data generating distribution, the number of support vectors
>is not a fixed deterministic number that is not dependent on the
>training set. Indeed, because of the randomness in the draw of the
>training set, the number of support vectors will be random. However,
>given a fixed training set, (and keeping other kernel parameters, and
>the tradeoff parameter C fixed also) the number of support vectors is fixed.
>
>3.
>a. Just because you know that there are 10K features, there is no reason
>to pre-suppose that either a linear or a non-linear kernel is a-priori
>better in a problem, *before you see training data*. However, given the
>training data consisting of (feature vector, label) pairs, there are
>many ways to decide which is the best kernel parameter, best
>regularization (ie tradeoff) parameter etc. For example, the simplest,
>albeit less theoretically well motivated approach is to simply compute
>the cross-validation estimate of the error for each setting of these
>parameters, and to choose the best setting based on these CV error
>estimates. Another, approach in the learning theory world is to use some
>(*very loose*, but in some tenuous sense the community claims as
>reasonable) learning theoretical upper bounds on the error rate and to
>minimize these bounds as a function of the parameters that we want to
>tune. A third, more Bayesian approach is to use approximate evidence (ie
>Bayesian marginal) computations for the same purpose. Nevertheless in
>many small-n-large-p problems (just the standard name employed in the
>stats community for classification problems where the number of training
>samples is much less than the number of features) occuring in the
>natural world, it has commonly been empirically observed that the simple
>linear kernel worked best. This is purely empirical evidence on some
>datasets, and *not necessarily true* of any specific data set you may be
>interested in analyzing. For any analysis you may undertake, I strongly
>suggest that you choose parameters using the methods mentioned above.
>
>b. the fact that the training data can be separated by a large margin
>does *not* mean you have evidence of overfitting. Indeed, the whole
>rationale for the SVM is based on this very observation. The whole
>learning theoretical premise based on which the SVM algorithm was
>developed, rests on their claims that (so long as all the samples lie in
>some unit hypersphere), the larger the margin separating the positive
>training samples from the negative samples, the better the guarantees
>they can provide that your test set error rate will not be very high.
>The Vapnik school of thought would say that the fact that you have found
>a large margin classifier on your training set is no reason to suspect
>overfitting, and in fact they would argue that it is reason to suspect
>excellent generalization (ie no overfitting). For more explanation of
>the learning theory see for instance Vapnik's book of 1998
>
>c. There is no evidence, either empirical or even learning theoretical,
>to claim that the SVM is the best classifier. It is also false to claim
>that the SVM is not affected by the curse of dimensionality: both
>empirical evidence and the learning theoretical arguments used to
>advance the SVM in the first place deny this. Specifically, in the
>radius/margin based upper bounds for error rate of hyperplane
>classifiers, *if the radius of the smallest hypersphere enclosing all
>the samples that could be generated were a constant* then we are
>justified in maximizing the margin in a learning algorithm. However,
>between different kernels that result from retaining different numbers
>of features, the radius value changes, so just maximizing the margin
>retaining all the original features is not guaranteed to optimize even
>the radius/margin bounds for test set error rate.
>
>The claim that the SVM is immune to the curse of dimensionality has long
>been dismissed by serious machine learning researchers. It was
>originally advanced in light of the fact that, *as compared to
>algorithms (like backpropagation for training neural networks) that only
>attempt to minimize the training error, without using any form of
>regularization*, the SVM algorithm was empirically found to be much more
>immune to the effect of irrelevant predictor variables. However, keeping
>the training sample size fixed, as the number of irrelevant predictor
>variables (ie features drawn at random in a way uncorrelated to the
>classification label) is increased to infinity, by sheer random chance,
>some of the purely random features will turn out to be well correlated
>to the output class label, thus you will eventually start to overfit as
>the number of these variables increases and starts to dominate your
>truly relevant variables. From this argument, it is clear that
>absolutely no system under the sun can ever be completelyunaffected by
>having an arbitrary number of irrelevant features. SVMs dont even come
>close to this long proclaimed (and also long disproved) holy grail of
>the early non-technical cheerleaders.
>
>Disclaimer:
>I fundamentally disagree with much of the learning theory behind the
>argument for minimizing upper bounds on error rates (which is sometimes
>held as the theoretical core of the argument for SVMS) and I strongly
>hold that their result is a logical fallacy (see Radford Neal's comments
>at his invited talk at NIPS for a brief explanation or I can post
>another reply about this if required). Though I have written a
>hand-waving explanation above in a deliberate attempt to retain clarity
>of explanation, I'm fairly sure that I understand all their logic,
>derivations and claims.
>
>Knowing that this whole area borders on a very old and counterproductive
>holy-war between Bayesians and frequentists, I have tried for the most
>part to be respectful and non-controversial in my explanation, and even
>tried to answer things from the learning theoretical community's point
>of view (since this is after all an SVM group). If despite my best
>efforts, I have given offense to anyone I apologise in advance; let us
>not start a pointless flame war.
>
>Monika Ray wrote:
>
>
>
>>Hello,
>>
>>1. What does the number of support vectors depend on? or is it random -
>>just the data close to the hyperplane?
>>
>>2. I don't understand why not all the datapoints near the optimal
>>hyperplane are not support vectors.
>>
>>3. If you have, say 10000 features, one has to use a non-liner kernel.
>>One
>>would need to use a polynomial kernel because only then you get the
>>combinations of the different datapoints.  However, because you have such
>>a large number of features, usign a polynomial kernel will map the data
>>into a feature space in which the points will be very far apart.  Then
>>separating the 2 classes will be a very trivial thing as you can have
>>
>>
>many
>
>
>>hyperplanes with large margin as the data is so sparse.  I udnerstadn
>>
>>
>that
>
>
>>some may call this overfitting..but in what other kind of kernel can you
>>get the combination of datapoints as you do with a polynomial
>>kernel...this is a kind of a priori knowledge you have about the problem.
>>
>>Sincerely,
>>Monika Ray
>>
>>
>
>
>
>>Quoting Monika Ray <[log in to unmask]>:
>>
>>
>>
>>>Hello,
>>>
>>>SVM is known to defeat the curse of dimensionality...then why is it so
>>>that having a large number of attributes/features in the data is not
>>>desirable when using SVM?
>>>
>>>I thought it should not matter....where am I getting confused?
>>>
>>>Thank You.
>>>
>>>Sincerely,
>>>Monika Ray
>>>
>>>**********************************************************************
>>>*
>>>The sweetest songs are those that tell of our saddest thought...
>>>
>>>Computational Intelligence Centre, Washington University St. louis, MO
>>>**********************************************************************
>>>
>>>
>>>
>>
>
>
>

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

October 2019
September 2019
August 2019
July 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
September 2017
August 2017
July 2017
June 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
2006
2005
2004
2003
2002
2001
2000


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager