JISCMail - ALLSTAT Archives

Hello

First, I apologise profusely for not getting back individually to the many
people who responded to my request for help - as usual for us all no doubt,
pressure of work intervened!!

However, many thanks to all who responded - it was greatly appreciated.

Let me summarize the problem - and each response in turn ... I've
maintained some degree of anonymity for the respondents ...

-The Problem-
Developing a classifier model, using logistic regression, where the outcome
(dependent variable) is binary - with massively unbalanced numbers of 0/1
counts (25:1 and even more extreme), and a reasonably large sample of 2500+
observations in total. Sampling is quasi-random, in that segments of the
seabed in various locations around New Zealand are being observed for
presence/absence of fish specie. This is the first time such samping has
taken place using the specie and particular seabed attributes. Aim is to
determine which variables are predictive of fish presence/absence

-My Suggested Solution-
Use all low count (present) data (n=250), subsample the remaining 2500
(absence) data, creating 1000 samples of size n=250, then implement
logistic regression on each of the 1000 samples (with the same n=250
"presence" sample merged with each random sample in turn). In this way,
bootstrap all parameters and create the "classifier" and cost-matrux using
average parameter values.


-Responses-

1.
This might be of interest to you:
http://www.biostat.wustl.edu/archives/html/s-news/2004-04/msg00104.html

Harry, S.

[P.B.]This was actually a link to two papers dealing with "Bias Reduction"
in MLE estimates - which seem to be concerned with calculating unbiased
parameter estimates for predictor variables which perfectly discriminate
between the two classes of dependent variable. This was not specifically
the problem with my data. However, interesting to know about the penalised
likelihood.
============================================================

2.
Just a very quick off-the-cuff response:

The 'predicted' value given by the logistic is a continuous variable. With
approximately equal 0s and 1s you would take <0.5 as 'predicting' 0 and
>0.5 as 'predicting' 1.  In your case, all the predicted values are
presumably <0.5, but if you used a different cut-off for 'prediction', and
took say >0.1 as predicting 1, you might get a better result.

You are studying a low-probability event, and you've looked at lots of
sites, and for every site where you've got a hit there are probably lots of
very similar sites where you've got a miss.  One possibility might be to
categorise your predictor variables so that you could group similar sites
(those with the same predictor values) and then model the probability of
fish presence within each type of site: i.e. essentially you move it from a
binary data problem to a binomial.

Good luck, anyway!

Dennis, C.

[P.B. The first paragraph is an excellent suggestion - in fact I used the
base-rate of occurrence as the boundary classifier probablity, and
classified the cases using that value. Worked a treat and seems quite
logical - should have thought of this myself!!]
============================================================

3.
I'm not quite sure I understand how you are tackling this. It sounds you
are taking samples of 200 x 1 and 200 x 0 values, giving 400 values equally
split between 1 and 0. Then you do it over, 1000 times.

Basically, you can't just rig the sample to oversample the rare class. The
reason is that you then have a non-ignorable design, i.e. the data don't
represent the underlying population. You have to correct for that. My
advice would be to use a Bayesian logistic regression on the rigged sample.
The prior distribution then allows you to specify what you know, namely
that the real population is different to the sample.

As for software, the WinBUGS software is free from the Medical Research
Council at

http://www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml

Hope this helps

Blaise, E.

[P.B. I'm not so sure that my suggested procedure is flawed. In fact, a
paper suggested below by another individual demonstrates that there is some
evidence that what I was suggesting might indeed be considered optimal. I
took a look at the WinBugs stuff - unusable without massive re0education
effort - and unclear whether it is required at all].

============================================================

4.
I think you need to look at *sparse* logistic regression.  Just a couple of
months ago there was an article about it in Bioinformatics (in gene
expression analysis you also have a lot of n<p problems):

  http://bioinformatics.oupjournals.org/cgi/content/abstract/19/17/2246

You'll find an R package SparseLogReg implementing the method on CRAN:

  http://cran.r-project.org/src/contrib/Descriptions/SparseLogReg.html

Cheers,
   Korbinian. S

[P.B. This methodology seems to be concerned with the problem of fitting
models to data where the number of predictors are huge in comparison to the
number of cases. Not my problem - but interesting papers nevertheless.].
============================================================

5.
Hi,

See the paper by Firth: Firth, D. (1993). Biometrika 80(1): 27-38.  He has
written a nice package for R, called brlr (bias-reduced logistic
regression), which implements his methods very nicely. It is available from
the R web site. You might also try a Web Of Science search for papers which
cite the above key paper.

Cheers,

Simon, B.
[P.B. This is the same advice essentially as in #1 above]
============================================================

6.
Dear Paul,

do I understand you correctly: you get 2700 predicted probabilities for
presence of fish but the highest prediction is below 0.5?

If you use now a cutpoint >=0.5 as prediction for fish then you will
obviously get a perfect specificity of 100 % (always predict no fish given
there is none). But your sensitivity has the poor value of 0 % (actually
never finding a fish given there is one).

Now just lower the cutpoint! If you try every possible cutpoint you will
end up with a so-called ROC-curve (receiver operating curve).

With SPSS you should be able to do this easily.

Kind regards,
Harald, H.
[P.B. This is the same advice essentially as in #2 above - but sows the
seeds of optimal classifer investigation using ROC and cost/benefit error
optimization - will do! ]


============================================================

7.
Dear Paul,

My first thought is that logistic regression is not the way to handle the
sort of data you have.  However, you don't say *why* you are doing this
analysis so I don't know your research question and it may be the best way
to approach the questions that actually motivate the study.  In any case I
will be most interested in the responses to your enquiry because a very low
rate of positives is a feature of many datasets that have come my way.  In
case I miss subsequent contributions in Allstat I'd appreciate a summary of
what you learn.

My way of characterizing your problem is to think of 200 events distributed
over a 10-dimensional space.  A natural question would be to consider if
they cluster in particular regions of that space, taking account of the
shape of the space itself and of how it has been sampled.

The shape of the space could come from PCA and if variables are correlated
redundancy can be reduced and the problem simplified by substituting a
smaller number of uncorrelated dimensions. This could be a starting point
for various analyses.

However, the approach that seems most direct is Discriminant Analysis since
it directly asks how best to distinguish the two populations of 'quadrats'
in the 10 dimensions.  Since it's a linear model (but so, really, is the
one you are
considering) it will only work well if the difference between the
populations can be expressed as a difference in their centroids.  It would
not work if the positives are 'clumped together' in various nooks and
crannies of the space.
In that case you might need some form of clustering approach.

Sandy, M.
[P.B. seems to hinge on the optimality or otherwise of linear (least
squares) DFA vs Logistic regression as DFA ... As to the redundancy issue -
I only have about 8-10 predictors!]

============================================================

8.
Hiya,

You want to look at oversampling.  Check out the following references:

King and Zeng in http://gking.harvard.edu/files/0s.pdf
Fitting logistic regression models under case control sampling, Scott and
Wild, JRSS B, 1986, 48, 170-182 Logistic disease incidence models and
case-control studies, Prentice & Pyke, Biometrika, 1979, 68, 3, page 403-11
Weiss and Provost in
http://www.research.rutgers.edu/~gweiss/papers/ml-tr-44.pdf

I am sure there are more....

Paul T.
[P.B. Bingo - these two sets of papers are exactly what the doctor ordered.
The King and Zeng papers are excellent - containing a correction for
oversampled (better thought of as a high ratio of non-occurrences to
occurrences). But, the Weiss and Provost is "it" in a nutshell. The title
of this paper is: "The Effect of class distribution on classifier learning:
an empirical study." From the abstract ... "This study shows that the
naturally occurring class distribution often is not best for learning, and
often substantially better performance can be obtained by using a different
class distribution". Their simulations are done using CART methods - but
the principle seems to be the same for logistic regression classifer
construction.]

============================================================

9.
Dear Paul

I have not come accross either the bootstrap scenario you are using and my
feeling is that you should perform the bootstrap on the original dataset
and not subsampling the "0" case data. Obviously if you carry out the
classical bootstrap you run the risk of not picking up any "1s" and the
whole estimation process will fail.
On the other hand doing it your own way you run the risk of over-estimating
the probability of presence in your equation.

I think it is the situation of a finite mixture problem, where you consider
that your sampled locations can be splitted into two populations; the first
one being the locations the fish exist and the second one being the fish do
not exist in that location. Having formed the likelihood for the two
populations, you can proceed by optimizing the mixture density using either
the EM algorithm or simulation.

Alternatively you could be a Bayesian to overcome the estimation problem.
You can have a look at "Regression analysis of count data" by Cameron &
Trivedi

Hope this is of some help to you ...

Best wishes
Dimitrious, L.

[P.B. Ok - the finite mixture stuff may be a goer - and the Bayesian stuff
- who knows?!! I just want a classifer that will fit incoming new data,
with some robustness. It would appear to be that anwers #2 and #6, and #8,
provide the most direct solutions].


So - I'm now setting up a small empirical study examining the utility of
simply using the base-rate as the classifier threshold on the unequal-count
data, using ROC to locate a possible optimum threshold, vs 50/50 class
distribution selective sample bootstrapping a la Weiss and Provost, using
CART and Logistic methods, on the some simulated and real datasets.

Many, many thanks to everybody who replied - and I hope the above is
helpful to some who were curious as they have the same kind of problem as I
do!

Regards ... Paul
_____________________________________________________________________
Paul Barrett
Adjunct Professor of Psychometrics and Performance Measurement
University of Auckland

email: [log in to unmask]             DDI: +64-(0)9-238-6336
       [log in to unmask]                    Fax: +64-(0)9-353-1681
Web:   www.pbarrett.net                        Mobile: +64-021-415625