Hello SVM-list,
this is my first post to this mailinglist. I'm currently writing my
diplomathesis about statistical approaches to email classification.
I implemented a prototype multi-class SVM (one-vs-one) and get good results
for a test email corpus with some very short emails.
I'm using a bag-of-words approach with stopwords, stemming and a RBF kernel.
My application needs to learn really fast as it will be incorporated into an
email-client software.
Expensive cross-validation is not really an option. Citing Joachims : "[...]
since all document vectors are normalized to unit length, it is easy to show
that the radius R of the ball containing all training examples is tightly
bound by [...] R^2 <= 2(1-exp(-gamma))"
I don't know how to interpret this, how do I get the radius of all samples
in feature space? Do I need to normalize all document vectors to unit length?
Choosing a good value for gamma is important to fill my kernel cache, maybe
I can use cross-validation for C, as this doesn't require updating the caches.
Thank you for any help or directions to my problem.
Best regards
Thorsten Jacoby
|