Hello First, I apologise profusely for not getting back individually to the many people who responded to my request for help - as usual for us all no doubt, pressure of work intervened!! However, many thanks to all who responded - it was greatly appreciated. Let me summarize the problem - and each response in turn ... I've maintained some degree of anonymity for the respondents ... -The Problem- Developing a classifier model, using logistic regression, where the outcome (dependent variable) is binary - with massively unbalanced numbers of 0/1 counts (25:1 and even more extreme), and a reasonably large sample of 2500+ observations in total. Sampling is quasi-random, in that segments of the seabed in various locations around New Zealand are being observed for presence/absence of fish specie. This is the first time such samping has taken place using the specie and particular seabed attributes. Aim is to determine which variables are predictive of fish presence/absence -My Suggested Solution- Use all low count (present) data (n=250), subsample the remaining 2500 (absence) data, creating 1000 samples of size n=250, then implement logistic regression on each of the 1000 samples (with the same n=250 "presence" sample merged with each random sample in turn). In this way, bootstrap all parameters and create the "classifier" and cost-matrux using average parameter values. -Responses- 1. This might be of interest to you: http://www.biostat.wustl.edu/archives/html/s-news/2004-04/msg00104.html Harry, S. [P.B.]This was actually a link to two papers dealing with "Bias Reduction" in MLE estimates - which seem to be concerned with calculating unbiased parameter estimates for predictor variables which perfectly discriminate between the two classes of dependent variable. This was not specifically the problem with my data. However, interesting to know about the penalised likelihood. ============================================================ 2. Just a very quick off-the-cuff response: The 'predicted' value given by the logistic is a continuous variable. With approximately equal 0s and 1s you would take <0.5 as 'predicting' 0 and >0.5 as 'predicting' 1. In your case, all the predicted values are presumably <0.5, but if you used a different cut-off for 'prediction', and took say >0.1 as predicting 1, you might get a better result. You are studying a low-probability event, and you've looked at lots of sites, and for every site where you've got a hit there are probably lots of very similar sites where you've got a miss. One possibility might be to categorise your predictor variables so that you could group similar sites (those with the same predictor values) and then model the probability of fish presence within each type of site: i.e. essentially you move it from a binary data problem to a binomial. Good luck, anyway! Dennis, C. [P.B. The first paragraph is an excellent suggestion - in fact I used the base-rate of occurrence as the boundary classifier probablity, and classified the cases using that value. Worked a treat and seems quite logical - should have thought of this myself!!] ============================================================ 3. I'm not quite sure I understand how you are tackling this. It sounds you are taking samples of 200 x 1 and 200 x 0 values, giving 400 values equally split between 1 and 0. Then you do it over, 1000 times. Basically, you can't just rig the sample to oversample the rare class. The reason is that you then have a non-ignorable design, i.e. the data don't represent the underlying population. You have to correct for that. My advice would be to use a Bayesian logistic regression on the rigged sample. The prior distribution then allows you to specify what you know, namely that the real population is different to the sample. As for software, the WinBUGS software is free from the Medical Research Council at http://www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml Hope this helps Blaise, E. [P.B. I'm not so sure that my suggested procedure is flawed. In fact, a paper suggested below by another individual demonstrates that there is some evidence that what I was suggesting might indeed be considered optimal. I took a look at the WinBugs stuff - unusable without massive re0education effort - and unclear whether it is required at all]. ============================================================ 4. I think you need to look at *sparse* logistic regression. Just a couple of months ago there was an article about it in Bioinformatics (in gene expression analysis you also have a lot of n<p problems): http://bioinformatics.oupjournals.org/cgi/content/abstract/19/17/2246 You'll find an R package SparseLogReg implementing the method on CRAN: http://cran.r-project.org/src/contrib/Descriptions/SparseLogReg.html Cheers, Korbinian. S [P.B. This methodology seems to be concerned with the problem of fitting models to data where the number of predictors are huge in comparison to the number of cases. Not my problem - but interesting papers nevertheless.]. ============================================================ 5. Hi, See the paper by Firth: Firth, D. (1993). Biometrika 80(1): 27-38. He has written a nice package for R, called brlr (bias-reduced logistic regression), which implements his methods very nicely. It is available from the R web site. You might also try a Web Of Science search for papers which cite the above key paper. Cheers, Simon, B. [P.B. This is the same advice essentially as in #1 above] ============================================================ 6. Dear Paul, do I understand you correctly: you get 2700 predicted probabilities for presence of fish but the highest prediction is below 0.5? If you use now a cutpoint >=0.5 as prediction for fish then you will obviously get a perfect specificity of 100 % (always predict no fish given there is none). But your sensitivity has the poor value of 0 % (actually never finding a fish given there is one). Now just lower the cutpoint! If you try every possible cutpoint you will end up with a so-called ROC-curve (receiver operating curve). With SPSS you should be able to do this easily. Kind regards, Harald, H. [P.B. This is the same advice essentially as in #2 above - but sows the seeds of optimal classifer investigation using ROC and cost/benefit error optimization - will do! ] ============================================================ 7. Dear Paul, My first thought is that logistic regression is not the way to handle the sort of data you have. However, you don't say *why* you are doing this analysis so I don't know your research question and it may be the best way to approach the questions that actually motivate the study. In any case I will be most interested in the responses to your enquiry because a very low rate of positives is a feature of many datasets that have come my way. In case I miss subsequent contributions in Allstat I'd appreciate a summary of what you learn. My way of characterizing your problem is to think of 200 events distributed over a 10-dimensional space. A natural question would be to consider if they cluster in particular regions of that space, taking account of the shape of the space itself and of how it has been sampled. The shape of the space could come from PCA and if variables are correlated redundancy can be reduced and the problem simplified by substituting a smaller number of uncorrelated dimensions. This could be a starting point for various analyses. However, the approach that seems most direct is Discriminant Analysis since it directly asks how best to distinguish the two populations of 'quadrats' in the 10 dimensions. Since it's a linear model (but so, really, is the one you are considering) it will only work well if the difference between the populations can be expressed as a difference in their centroids. It would not work if the positives are 'clumped together' in various nooks and crannies of the space. In that case you might need some form of clustering approach. Sandy, M. [P.B. seems to hinge on the optimality or otherwise of linear (least squares) DFA vs Logistic regression as DFA ... As to the redundancy issue - I only have about 8-10 predictors!] ============================================================ 8. Hiya, You want to look at oversampling. Check out the following references: King and Zeng in http://gking.harvard.edu/files/0s.pdf Fitting logistic regression models under case control sampling, Scott and Wild, JRSS B, 1986, 48, 170-182 Logistic disease incidence models and case-control studies, Prentice & Pyke, Biometrika, 1979, 68, 3, page 403-11 Weiss and Provost in http://www.research.rutgers.edu/~gweiss/papers/ml-tr-44.pdf I am sure there are more.... Paul T. [P.B. Bingo - these two sets of papers are exactly what the doctor ordered. The King and Zeng papers are excellent - containing a correction for oversampled (better thought of as a high ratio of non-occurrences to occurrences). But, the Weiss and Provost is "it" in a nutshell. The title of this paper is: "The Effect of class distribution on classifier learning: an empirical study." From the abstract ... "This study shows that the naturally occurring class distribution often is not best for learning, and often substantially better performance can be obtained by using a different class distribution". Their simulations are done using CART methods - but the principle seems to be the same for logistic regression classifer construction.] ============================================================ 9. Dear Paul I have not come accross either the bootstrap scenario you are using and my feeling is that you should perform the bootstrap on the original dataset and not subsampling the "0" case data. Obviously if you carry out the classical bootstrap you run the risk of not picking up any "1s" and the whole estimation process will fail. On the other hand doing it your own way you run the risk of over-estimating the probability of presence in your equation. I think it is the situation of a finite mixture problem, where you consider that your sampled locations can be splitted into two populations; the first one being the locations the fish exist and the second one being the fish do not exist in that location. Having formed the likelihood for the two populations, you can proceed by optimizing the mixture density using either the EM algorithm or simulation. Alternatively you could be a Bayesian to overcome the estimation problem. You can have a look at "Regression analysis of count data" by Cameron & Trivedi Hope this is of some help to you ... Best wishes Dimitrious, L. [P.B. Ok - the finite mixture stuff may be a goer - and the Bayesian stuff - who knows?!! I just want a classifer that will fit incoming new data, with some robustness. It would appear to be that anwers #2 and #6, and #8, provide the most direct solutions]. So - I'm now setting up a small empirical study examining the utility of simply using the base-rate as the classifier threshold on the unequal-count data, using ROC to locate a possible optimum threshold, vs 50/50 class distribution selective sample bootstrapping a la Weiss and Provost, using CART and Logistic methods, on the some simulated and real datasets. Many, many thanks to everybody who replied - and I hope the above is helpful to some who were curious as they have the same kind of problem as I do! Regards ... Paul _____________________________________________________________________ Paul Barrett Adjunct Professor of Psychometrics and Performance Measurement University of Auckland email: [log in to unmask] DDI: +64-(0)9-238-6336 [log in to unmask] Fax: +64-(0)9-353-1681 Web: www.pbarrett.net Mobile: +64-021-415625