JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for SUPPORT-VECTOR-MACHINES Archives


SUPPORT-VECTOR-MACHINES Archives

SUPPORT-VECTOR-MACHINES Archives


SUPPORT-VECTOR-MACHINES@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

SUPPORT-VECTOR-MACHINES Home

SUPPORT-VECTOR-MACHINES Home

SUPPORT-VECTOR-MACHINES  2005

SUPPORT-VECTOR-MACHINES 2005

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Re: training/test

From:

"Hsiung, Chang (DPAN)" <[log in to unmask]>

Reply-To:

The Support Vector Machine discussion list <[log in to unmask]>

Date:

Tue, 8 Mar 2005 10:09:47 -0700

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (137 lines)

How did you determine the SVM parameters (C and kernel width etc) ?
Did you find them based on cross validation of training set then apply same parameters to test set ?
Chang Hsiung

-----Original Message-----
From: The Support Vector Machine discussion list
[mailto:[log in to unmask]]On Behalf Of Monika Ray
Sent: Tuesday, March 08, 2005 8:52 AM
To: [log in to unmask]
Subject: Re: training/test


Thank you Balaji.  I knew the theory behind it ...but its good to see
someone else's explanation as it always helps understand things better.

As a  matter of fact, I don't trust my model and even though it did well
on the test set..I feel that the training sample set is too small.

However, reasons why this scenario came up in the 1st place is this-

One finds numerous svm/nn application papers which state their
excellent results on test set.  However, they never mention what happened
on the training set.  If they got a 100% or even 90% accuracy on training
set, it is fine.  However, if they had got a 68% accuracy..there may be
some readers who are not satisfied with such a model.

This scenario comes into play especially in the non-CS  areas.
One finds many biologists/physicians/some EE publishing papers doing
exactly such a thing.

With the explosion of gene expression, most clinicians have begun doing AI
on that.  However, majority of these papers have their total number of
samples to be between 40-60 and yet the number of dimensions is extremely
high.  They state that all is well with SVM/NN
or whatever other method they use as it gives excellent results on test
set.  However, considering their small
sample set, I would like to see how the model fared on the training
set.

Thank You.

Sincerely,
Monika Ray

***********************************************************************
The sweetest songs are those that tell of our saddest thought...

Computational Intelligence Centre, Washington University St. louis, MO
**********************************************************************


On Fri, 4 Mar 2005, Balaji Krishnapuram wrote:

> Monika,
>
> An interesting question.
>
> Lets first summarize the message from learning theory that underlies much of the SVM (in this case I mean really the theory behind the two class problem, mostly because I am not very familiar with the theoretical arguments for the 1 class model, but others may clarify the issue).
>
> Basically learning theory gives us the following sorts of guarantees, and I'll outline the broad approach, again using hand-waving intuitive arguments for the sake of clarity (eschewing equations and more rigorous presentation, since these can be found elsewhere, but often presented in a form which makes us lose sight of the forest for the trees):
>
> 1. We start with an assumption/intuition (eg classifiers that have large margins on the training data will have good generalization, or classifiers which use very few basis function s or kernels achieve good generalization). In PAC-Bayes theory (see for eg Matthias Seeger's work on PAC-Bayes theory which extends and improves upon the original proofs by McAllester), this is made explicit, but any other method I have seen also starts with some assumptions which are usually left implicit.
>
> 2. Using the intuition in step 1, and a further assumption that every sample is drawn i.i.d from an underlying probability distribution P_{XY}(x,y) we can then develop probabilistic upper bounds on the error rate that my classifier will incur on unseen future tests samples.
>
> In general these bounds state-
>
> "with probability (1-delta) over the random draw of a training set of i.i.d samples drawn from the same probability distribution P_{XY} the following upper bound will be true (i.e the following bound will not be violated):
>
> The true (i.e. test set) error rate on samples drawn i.i.d from P_{XY} will be <= a function, bound(training set error rate [e], number of training samples [m], how often the bound can be violated [delta],  a quantitative measure of how much my prior intuition encoded in step 1 is supported by or disproved by my training data [eg margin on the training data used as a running example below, or actually other intuitions like the belief that sparse classifiers that use few basis functions or kernels will have good generalization])"
>
> the function bound(...) depends on several parameters, including the parameter delta. If delta is very small (eg tending to 0), the inequality mentioned above will be satisfied almost always, over a random draw of the training set. Unfortunately, the function bound(...) increases monotonically when delta becomes smaller, so that when delta is close to zero the value of the function bound(...) becomes very large, and in the limit the inequality simply says the true error rate<100% which is always guaranteed to be true.
>
> On the other hand when delta becomes larger, the value of the function bound(...) decreases monotonically to the point when it makes more emphatic(ie non-obvious) statements. Unfortunately, in this case, these statements (ie the inequality written above within the quotes) is violated more often over the random draw of training samples, so the inequality doesnt mean much in the limit.
>
> secondly, if the number of training samples become small, again the function bound(...) becomes large: the intuition is that based on a very small training set, I cant make any emphatic claims on unseen test samples, and this is only to be expected. In short, without data, you just cant make any statements about what to expect in future, and there is no magic answer that will universally satisfy you. The same situation is also found when we provide an algorithm with too little training data, but dont ask what is too little or it will start a whole discussion on sample complexity bounds :)
>
> Ok, now that we understand what these bounds are like, what are the take home messages that come out?
>
> The function bound(...) also becomes small when the training set error e is small *and at the same time* the original intuition we started out with is strongly validated by the training data. Eg. if we see that the training error is small, and also that the radius/margin on the training data is very small [ie margin is large] then, we can make strong statements that my generalization error on unseen samples will also be small.
>
> If either of these conditions in the previous paragraph is not met, then, I cant give you strong guarantees that you will have good generalization. Please be very careful in reading/interpreting this statement, because this is important! Basically if I find that either my training set error is large, or that my margin on the training sample is small then I cannot *guarantee* that you will achieve good (ie low) generalization error rate on unseen test samples.
>
>
> What does this mean?
> Basically this means that in order to guarantee that I can get good generalization, a *sufficient* condition is that i have: (a) low training error, and (b) I satisfy the original intuition used to derive these bounds. If either of these are not true, then I cant guarantee good performance on unseen data *using bounds derived from this specific intuition*. Thus, these sorts of learning theoretical guarantees are of the form of sufficient conditions for ensuring good generalization, *but not necessary conditions*. In particular, if either my training error rate is large, or my starting intuition for deriving the bound is violated, you may still achieve excellent generalization on unseen test samples, but I just cant guarantee this confidently *using the bound derived based on this intuition*, since the bound will state that my error rate is less than a large number (such as 90%, or even 100% which will always be the case), but this does not mean that the generalization error cant be small in practice!!
>
> This points to one of the major failings of learning theory which people tend to gloss over while eulogizing these so called break-throughs from learning theory. This  is also not just an academic criticism, but one which occurs often in practice. For example Herbrich and Graepel showed in an elegant NIPS paper that the error bounds they obtained for a simple perceptron classifier were much better than those of the SVM even though the perceptron algorithm achiev ed much smaller bounds than the SVM. Another, more poignant proof of this concept for me was the observation on many practical problems that algorithms which used all the basis functions often achieved better generalization than those that used only a small, sparse set of basis functions (eg few support vectors), even though I had derived the same sort of generalization bounds for the latter as were available for the SVM (and besides being tighter bounds than the margin bounds for the SVM this sparse classifier algorithm performed at least as well as the SVM on many of the test problems from UCI datasets, USPS etc that are used as general purpose benchmark tests of these methods/theoretical bounds etc).
>
> Basic take home message from this whole, long discussion: dont expect practical answers for your question from the learning theory that underlies much of the SVM literature. Now, another complementary strain of learning theory is somewhat more trustworthy: the literature on *test set bounds* (see eg Langford's ICML-03 tutorial paper, also written up as a JMLR paper later) gives both good upper bounds and good lower bounds on true test error, based on what you found on a separate set of iid test samples. These bounds are also much tighter so even with small sample sizes, the predictions are much more trustworhty. So in this sense, if on a large enough set of test samples not used during training you found that your method worked well, then you can guarantee good generalization on future iid test samples which you have not seen yet *without making any further assumptions except iid data*.
>
> Okay.... now that the rant is over, what are some practical things you can do?
> First, do you truly believe the data is iid, and further do you really have large enough test set size for concluding that the test set performance is 100%? If so, rely on the test set error bounds to guide you, and you should conclude that you can trust your model, regardless of the training set error (on an admittedly small training set).
>
> One excellent advice already given in the previous reply to your post is to look at the cross validation stability. Alternatively, the intuition I take away from that message is to closely investigate what is so strangely different between the mistakes in your training sample, and the rest of the test sample which seemed to be correctly handled: are they coming from the same distribution at all or are you ignoring some important non-homogeneity in the data? In other words fundamentally investigate if your data is at all i.i.d!?
>
> If you find physical reasons to suspect non-homogeneity (eg several samples collected from the same patient, and the total pool of samples collected from several such patients, so that samples from same patient are probably not independent. viewed differently, the data has clusters and the clusters are also not uniformly samples), then resort to more sophisticated statistical models such as generalized linear mixed or random effects models if they are appropriate. These models relax the independence assumption using additional information, and may be able to better model your data... I dont know without further information, and this is only a reasonable suggestion.
>
> A (Lindley style) subjective Bayesian answer to your question (can I trust my model?) is that you should not even break the data into training and test sets. In order to compare different models (eg forst order polynomial and second order polynomial) I throw all (ie training and testing) data into one large pool, and evaluate the marginal, ie evidence for the model. Whichever model among those you consider is the one which best predicts your data (or results therein) is the one you should trust most, but you should still use all of them to predict things about the future, only you should weight the predictions of each model by the posterior probability for the model after seeing the training data.
>
>
> Finally there are also stability based learning theoretical bounds (if I split my data into subsets, how much is the variation in performance on different subsets, and what does this say about performance on future unseen samples), and I'm not going to go into that here. I'll fight that battle some other day!
>
> Hope that long winded answer helped at least somewhat:)
>
> Balaji
>
>
>
> Monika Ray wrote:
> > Hello,
> >
> > I know that its ok to not get a 100% correct result on the training set as
> > long as one can get good accuracy on the test set.
> >
> > I was using one class svm for data that had too few ssamples for 1 class.
> > I got 100% accuracy on the test set, but the accuracy on the training set
> > was less than 50%...so should I be trusting the model?  Methinks not.
> >
> > What is your opinion...what should be the minimum accuracy on the training
> > set?
> >
> > Sincerely,
> > Monika Ray
> >
> > ***********************************************************************
> > The sweetest songs are those that tell of our saddest thought...
> >
> > Computational Intelligence Centre, Washington University St. louis, MO
> > **********************************************************************
> >
>

******************************************
The information contained in, or attached to, this e-mail, may contain confidential information and is intended solely for the use of the individual or entity to whom they are addressed and may be subject to legal privilege.  If you have received this e-mail in error you should notify the sender immediately by reply e-mail, delete the message from your system and notify your system manager.  Please do not copy it for any purpose, or disclose its contents to any other person.  The views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of the company.  The recipient should check this e-mail and any attachments for the presence of viruses.  The company accepts no liability for any damage caused, directly or indirectly, by any virus transmitted in this email.
******************************************

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
September 2017
August 2017
July 2017
June 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
2006
2005
2004
2003
2002
2001
2000


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager