yes.
Sincerely,
Monika Ray
***********************************************************************
The sweetest songs are those that tell of our saddest thought...
Computational Intelligence Centre, Washington University St. louis, MO
**********************************************************************
On Tue, 8 Mar 2005, Hsiung, Chang (DPAN) wrote:
> How did you determine the SVM parameters (C and kernel width etc) ?
> Did you find them based on cross validation of training set then apply same parameters to test set ?
> Chang Hsiung
>
> Original Message
> From: The Support Vector Machine discussion list
> [mailto:[log in to unmask]]On Behalf Of Monika Ray
> Sent: Tuesday, March 08, 2005 8:52 AM
> To: [log in to unmask]
> Subject: Re: training/test
>
>
> Thank you Balaji. I knew the theory behind it ...but its good to see
> someone else's explanation as it always helps understand things better.
>
> As a matter of fact, I don't trust my model and even though it did well
> on the test set..I feel that the training sample set is too small.
>
> However, reasons why this scenario came up in the 1st place is this
>
> One finds numerous svm/nn application papers which state their
> excellent results on test set. However, they never mention what happened
> on the training set. If they got a 100% or even 90% accuracy on training
> set, it is fine. However, if they had got a 68% accuracy..there may be
> some readers who are not satisfied with such a model.
>
> This scenario comes into play especially in the nonCS areas.
> One finds many biologists/physicians/some EE publishing papers doing
> exactly such a thing.
>
> With the explosion of gene expression, most clinicians have begun doing AI
> on that. However, majority of these papers have their total number of
> samples to be between 4060 and yet the number of dimensions is extremely
> high. They state that all is well with SVM/NN
> or whatever other method they use as it gives excellent results on test
> set. However, considering their small
> sample set, I would like to see how the model fared on the training
> set.
>
> Thank You.
>
> Sincerely,
> Monika Ray
>
> ***********************************************************************
> The sweetest songs are those that tell of our saddest thought...
>
> Computational Intelligence Centre, Washington University St. louis, MO
> **********************************************************************
>
>
> On Fri, 4 Mar 2005, Balaji Krishnapuram wrote:
>
> > Monika,
> >
> > An interesting question.
> >
> > Lets first summarize the message from learning theory that underlies much of the SVM (in this case I mean really the theory behind the two class problem, mostly because I am not very familiar with the theoretical arguments for the 1 class model, but others may clarify the issue).
> >
> > Basically learning theory gives us the following sorts of guarantees, and I'll outline the broad approach, again using handwaving intuitive arguments for the sake of clarity (eschewing equations and more rigorous presentation, since these can be found elsewhere, but often presented in a form which makes us lose sight of the forest for the trees):
> >
> > 1. We start with an assumption/intuition (eg classifiers that have large margins on the training data will have good generalization, or classifiers which use very few basis function s or kernels achieve good generalization). In PACBayes theory (see for eg Matthias Seeger's work on PACBayes theory which extends and improves upon the original proofs by McAllester), this is made explicit, but any other method I have seen also starts with some assumptions which are usually left implicit.
> >
> > 2. Using the intuition in step 1, and a further assumption that every sample is drawn i.i.d from an underlying probability distribution P_{XY}(x,y) we can then develop probabilistic upper bounds on the error rate that my classifier will incur on unseen future tests samples.
> >
> > In general these bounds state
> >
> > "with probability (1delta) over the random draw of a training set of i.i.d samples drawn from the same probability distribution P_{XY} the following upper bound will be true (i.e the following bound will not be violated):
> >
> > The true (i.e. test set) error rate on samples drawn i.i.d from P_{XY} will be <= a function, bound(training set error rate [e], number of training samples [m], how often the bound can be violated [delta], a quantitative measure of how much my prior intuition encoded in step 1 is supported by or disproved by my training data [eg margin on the training data used as a running example below, or actually other intuitions like the belief that sparse classifiers that use few basis functions or kernels will have good generalization])"
> >
> > the function bound(...) depends on several parameters, including the parameter delta. If delta is very small (eg tending to 0), the inequality mentioned above will be satisfied almost always, over a random draw of the training set. Unfortunately, the function bound(...) increases monotonically when delta becomes smaller, so that when delta is close to zero the value of the function bound(...) becomes very large, and in the limit the inequality simply says the true error rate<100% which is always guaranteed to be true.
> >
> > On the other hand when delta becomes larger, the value of the function bound(...) decreases monotonically to the point when it makes more emphatic(ie nonobvious) statements. Unfortunately, in this case, these statements (ie the inequality written above within the quotes) is violated more often over the random draw of training samples, so the inequality doesnt mean much in the limit.
> >
> > secondly, if the number of training samples become small, again the function bound(...) becomes large: the intuition is that based on a very small training set, I cant make any emphatic claims on unseen test samples, and this is only to be expected. In short, without data, you just cant make any statements about what to expect in future, and there is no magic answer that will universally satisfy you. The same situation is also found when we provide an algorithm with too little training data, but dont ask what is too little or it will start a whole discussion on sample complexity bounds :)
> >
> > Ok, now that we understand what these bounds are like, what are the take home messages that come out?
> >
> > The function bound(...) also becomes small when the training set error e is small *and at the same time* the original intuition we started out with is strongly validated by the training data. Eg. if we see that the training error is small, and also that the radius/margin on the training data is very small [ie margin is large] then, we can make strong statements that my generalization error on unseen samples will also be small.
> >
> > If either of these conditions in the previous paragraph is not met, then, I cant give you strong guarantees that you will have good generalization. Please be very careful in reading/interpreting this statement, because this is important! Basically if I find that either my training set error is large, or that my margin on the training sample is small then I cannot *guarantee* that you will achieve good (ie low) generalization error rate on unseen test samples.
> >
> >
> > What does this mean?
> > Basically this means that in order to guarantee that I can get good generalization, a *sufficient* condition is that i have: (a) low training error, and (b) I satisfy the original intuition used to derive these bounds. If either of these are not true, then I cant guarantee good performance on unseen data *using bounds derived from this specific intuition*. Thus, these sorts of learning theoretical guarantees are of the form of sufficient conditions for ensuring good generalization, *but not necessary conditions*. In particular, if either my training error rate is large, or my starting intuition for deriving the bound is violated, you may still achieve excellent generalization on unseen test samples, but I just cant guarantee this confidently *using the bound derived based on this intuition*, since the bound will state that my error rate is less than a large number (such as 90%, or even 100% which will always be the case), but this does not mean that the generalization error cant be small in practice!!
> >
> > This points to one of the major failings of learning theory which people tend to gloss over while eulogizing these so called breakthroughs from learning theory. This is also not just an academic criticism, but one which occurs often in practice. For example Herbrich and Graepel showed in an elegant NIPS paper that the error bounds they obtained for a simple perceptron classifier were much better than those of the SVM even though the perceptron algorithm achiev ed much smaller bounds than the SVM. Another, more poignant proof of this concept for me was the observation on many practical problems that algorithms which used all the basis functions often achieved better generalization than those that used only a small, sparse set of basis functions (eg few support vectors), even though I had derived the same sort of generalization bounds for the latter as were available for the SVM (and besides being tighter bounds than the margin bounds for the SVM this sparse classifier algorithm performed at least as well as the SVM on many of the test problems from UCI datasets, USPS etc that are used as general purpose benchmark tests of these methods/theoretical bounds etc).
> >
> > Basic take home message from this whole, long discussion: dont expect practical answers for your question from the learning theory that underlies much of the SVM literature. Now, another complementary strain of learning theory is somewhat more trustworthy: the literature on *test set bounds* (see eg Langford's ICML03 tutorial paper, also written up as a JMLR paper later) gives both good upper bounds and good lower bounds on true test error, based on what you found on a separate set of iid test samples. These bounds are also much tighter so even with small sample sizes, the predictions are much more trustworhty. So in this sense, if on a large enough set of test samples not used during training you found that your method worked well, then you can guarantee good generalization on future iid test samples which you have not seen yet *without making any further assumptions except iid data*.
> >
> > Okay.... now that the rant is over, what are some practical things you can do?
> > First, do you truly believe the data is iid, and further do you really have large enough test set size for concluding that the test set performance is 100%? If so, rely on the test set error bounds to guide you, and you should conclude that you can trust your model, regardless of the training set error (on an admittedly small training set).
> >
> > One excellent advice already given in the previous reply to your post is to look at the cross validation stability. Alternatively, the intuition I take away from that message is to closely investigate what is so strangely different between the mistakes in your training sample, and the rest of the test sample which seemed to be correctly handled: are they coming from the same distribution at all or are you ignoring some important nonhomogeneity in the data? In other words fundamentally investigate if your data is at all i.i.d!?
> >
> > If you find physical reasons to suspect nonhomogeneity (eg several samples collected from the same patient, and the total pool of samples collected from several such patients, so that samples from same patient are probably not independent. viewed differently, the data has clusters and the clusters are also not uniformly samples), then resort to more sophisticated statistical models such as generalized linear mixed or random effects models if they are appropriate. These models relax the independence assumption using additional information, and may be able to better model your data... I dont know without further information, and this is only a reasonable suggestion.
> >
> > A (Lindley style) subjective Bayesian answer to your question (can I trust my model?) is that you should not even break the data into training and test sets. In order to compare different models (eg forst order polynomial and second order polynomial) I throw all (ie training and testing) data into one large pool, and evaluate the marginal, ie evidence for the model. Whichever model among those you consider is the one which best predicts your data (or results therein) is the one you should trust most, but you should still use all of them to predict things about the future, only you should weight the predictions of each model by the posterior probability for the model after seeing the training data.
> >
> >
> > Finally there are also stability based learning theoretical bounds (if I split my data into subsets, how much is the variation in performance on different subsets, and what does this say about performance on future unseen samples), and I'm not going to go into that here. I'll fight that battle some other day!
> >
> > Hope that long winded answer helped at least somewhat:)
> >
> > Balaji
> >
> >
> >
> > Monika Ray wrote:
> > > Hello,
> > >
> > > I know that its ok to not get a 100% correct result on the training set as
> > > long as one can get good accuracy on the test set.
> > >
> > > I was using one class svm for data that had too few ssamples for 1 class.
> > > I got 100% accuracy on the test set, but the accuracy on the training set
> > > was less than 50%...so should I be trusting the model? Methinks not.
> > >
> > > What is your opinion...what should be the minimum accuracy on the training
> > > set?
> > >
> > > Sincerely,
> > > Monika Ray
> > >
> > > ***********************************************************************
> > > The sweetest songs are those that tell of our saddest thought...
> > >
> > > Computational Intelligence Centre, Washington University St. louis, MO
> > > **********************************************************************
> > >
> >
>
> ******************************************
> The information contained in, or attached to, this email, may contain confidential information and is intended solely for the use of the individual or entity to whom they are addressed and may be subject to legal privilege. If you have received this email in error you should notify the sender immediately by reply email, delete the message from your system and notify your system manager. Please do not copy it for any purpose, or disclose its contents to any other person. The views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused, directly or indirectly, by any virus transmitted in this email.
> ******************************************
>
