Let me go into a bit more detail on the problem, to explain why I have
so many independent variables. I'm working on a project that requires
classifying web pages as "interesting" or "not interesting". The
independent variables are 0/1 variables indicating whether a particular
feature is present on the web page -- generally, whether or not a
particular word is found on the page. This is why I have such a large
number of independent variables, even after some preprocessing to select
only those features that appear most promising.
I had originally considered using the technique used by the new
"Bayesian" spam filters (http://www.paulgraham.com/antispam.html) to
classify email as spam or non-spam. That technique is based on
computing, for each feature i, the frequency of positive classifications
given that feature i is found, then combining these frequencies to come
up with a formula for P(class | set of features). However, on closer
inspection the assumptions behind this technique are problematic, those
who get it to work incorporate a variety of heuristic hacks to get
reasonable performance, and I found it didn't work well for my problem.
I then decided to work out from first principles how to properly combine
the information in those frequencies, applying Jaynes's principle of
maximum entropy. I supposed a given distribution P(x) = m(x) for the
vector of independent variables x, then constructed the maximum-entropy
joint distribution I over x and y (the 0/1 dependent variable)
satisfying the following constraints:
- P(x = x' | I) = m(x')
- P(y = 1 | x[i] = 1, I) = f_i, for all i
Taking this distribution and computing P(y = 1 | x = x', I), I found
that the distribution m(x) cancelled out, and the result had the form of
a logistic regression, i.e.,
logit(P(y = 1 | x = x', I)) = beta0 + (SUM i:: beta[i] * x[i]).
Sujit Ghosh wrote:
> Did you try this problem in other software like SAS, R etc? [...]
I have now. Doing 22-fold cross-validation for the entire process
(feature selection + regression), and classifying a page as interesting
if the computed probability of "interesting" is >= 0.5, I find that I
have a 9% classification error rate on "interesting" examples and a 3.6%
error rate on "uninteresting" examples. This is good enough to be
useful for my project, although I would like to improve it.
> BTW for logistic model the even the MLE could be infinite for
> some configuration of the design matrix X.
This does in fact happen -- some words appear only in web pages of one
class. I handled that by effectively introducing a zero-centered prior
on the beta's -- adding, for each feature i, two training examples
having only feature i and no other, one labeled as "interesting" and the
other labeled as "uninteresting". (Plotting this prior, I found it to
be a reasonably close representation of my expectations.)
David Spiegelhalter wrote:
> perhaps WinBUGS is suggesting that fitting 480 independent variables to
> 770 datapoints may not be reasonable!
That's precisely why I was wary of using MLE... which ended up working
better than I expected. My motive in using WinBUGS was to mitigate the
problem by collecting a sample over the posterior, then either
- choosing the posterior median for each parameter,
- choosing the posterior mean for each parameter, or
- using the sample to do minimum predictive discrepancy parameter
estimation (http://leuther-analytics.com/papers.html).
-------------------------------------------------------------------
This list is for discussion of modelling issues and the BUGS software.
For help with crashes and error messages, first mail [log in to unmask]
To mail the BUGS list, mail to [log in to unmask]
Before mailing, please check the archive at www.jiscmail.ac.uk/lists/bugs.html
Please do not mail attachments to the list.
To leave the BUGS list, send LEAVE BUGS to [log in to unmask]
If this fails, mail [log in to unmask], NOT the whole list
|