JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for ALLSTAT Archives


ALLSTAT Archives

ALLSTAT Archives


allstat@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

ALLSTAT Home

ALLSTAT Home

ALLSTAT  2004

ALLSTAT 2004

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Re: Logistic Regression Responses

From:

Paul Barrett <[log in to unmask]>

Reply-To:

Paul Barrett <[log in to unmask]>

Date:

Mon, 28 Jun 2004 18:41:58 +1200

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (282 lines)

Hello

First, I apologise profusely for not getting back individually to the many
people who responded to my request for help - as usual for us all no doubt,
pressure of work intervened!!

However, many thanks to all who responded - it was greatly appreciated.

Let me summarize the problem - and each response in turn ... I've
maintained some degree of anonymity for the respondents ...

-The Problem-
Developing a classifier model, using logistic regression, where the outcome
(dependent variable) is binary - with massively unbalanced numbers of 0/1
counts (25:1 and even more extreme), and a reasonably large sample of 2500+
observations in total. Sampling is quasi-random, in that segments of the
seabed in various locations around New Zealand are being observed for
presence/absence of fish specie. This is the first time such samping has
taken place using the specie and particular seabed attributes. Aim is to
determine which variables are predictive of fish presence/absence

-My Suggested Solution-
Use all low count (present) data (n=250), subsample the remaining 2500
(absence) data, creating 1000 samples of size n=250, then implement
logistic regression on each of the 1000 samples (with the same n=250
"presence" sample merged with each random sample in turn). In this way,
bootstrap all parameters and create the "classifier" and cost-matrux using
average parameter values.


-Responses-

1.
This might be of interest to you:
http://www.biostat.wustl.edu/archives/html/s-news/2004-04/msg00104.html

Harry, S.

[P.B.]This was actually a link to two papers dealing with "Bias Reduction"
in MLE estimates - which seem to be concerned with calculating unbiased
parameter estimates for predictor variables which perfectly discriminate
between the two classes of dependent variable. This was not specifically
the problem with my data. However, interesting to know about the penalised
likelihood.
============================================================

2.
Just a very quick off-the-cuff response:

The 'predicted' value given by the logistic is a continuous variable. With
approximately equal 0s and 1s you would take <0.5 as 'predicting' 0 and
>0.5 as 'predicting' 1.  In your case, all the predicted values are
presumably <0.5, but if you used a different cut-off for 'prediction', and
took say >0.1 as predicting 1, you might get a better result.

You are studying a low-probability event, and you've looked at lots of
sites, and for every site where you've got a hit there are probably lots of
very similar sites where you've got a miss.  One possibility might be to
categorise your predictor variables so that you could group similar sites
(those with the same predictor values) and then model the probability of
fish presence within each type of site: i.e. essentially you move it from a
binary data problem to a binomial.

Good luck, anyway!

Dennis, C.

[P.B. The first paragraph is an excellent suggestion - in fact I used the
base-rate of occurrence as the boundary classifier probablity, and
classified the cases using that value. Worked a treat and seems quite
logical - should have thought of this myself!!]
============================================================

3.
I'm not quite sure I understand how you are tackling this. It sounds you
are taking samples of 200 x 1 and 200 x 0 values, giving 400 values equally
split between 1 and 0. Then you do it over, 1000 times.

Basically, you can't just rig the sample to oversample the rare class. The
reason is that you then have a non-ignorable design, i.e. the data don't
represent the underlying population. You have to correct for that. My
advice would be to use a Bayesian logistic regression on the rigged sample.
The prior distribution then allows you to specify what you know, namely
that the real population is different to the sample.

As for software, the WinBUGS software is free from the Medical Research
Council at

http://www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml

Hope this helps

Blaise, E.

[P.B. I'm not so sure that my suggested procedure is flawed. In fact, a
paper suggested below by another individual demonstrates that there is some
evidence that what I was suggesting might indeed be considered optimal. I
took a look at the WinBugs stuff - unusable without massive re0education
effort - and unclear whether it is required at all].

============================================================

4.
I think you need to look at *sparse* logistic regression.  Just a couple of
months ago there was an article about it in Bioinformatics (in gene
expression analysis you also have a lot of n<p problems):

  http://bioinformatics.oupjournals.org/cgi/content/abstract/19/17/2246

You'll find an R package SparseLogReg implementing the method on CRAN:

  http://cran.r-project.org/src/contrib/Descriptions/SparseLogReg.html

Cheers,
   Korbinian. S

[P.B. This methodology seems to be concerned with the problem of fitting
models to data where the number of predictors are huge in comparison to the
number of cases. Not my problem - but interesting papers nevertheless.].
============================================================

5.
Hi,

See the paper by Firth: Firth, D. (1993). Biometrika 80(1): 27-38.  He has
written a nice package for R, called brlr (bias-reduced logistic
regression), which implements his methods very nicely. It is available from
the R web site. You might also try a Web Of Science search for papers which
cite the above key paper.

Cheers,

Simon, B.
[P.B. This is the same advice essentially as in #1 above]
============================================================

6.
Dear Paul,

do I understand you correctly: you get 2700 predicted probabilities for
presence of fish but the highest prediction is below 0.5?

If you use now a cutpoint >=0.5 as prediction for fish then you will
obviously get a perfect specificity of 100 % (always predict no fish given
there is none). But your sensitivity has the poor value of 0 % (actually
never finding a fish given there is one).

Now just lower the cutpoint! If you try every possible cutpoint you will
end up with a so-called ROC-curve (receiver operating curve).

With SPSS you should be able to do this easily.

Kind regards,
Harald, H.
[P.B. This is the same advice essentially as in #2 above - but sows the
seeds of optimal classifer investigation using ROC and cost/benefit error
optimization - will do! ]


============================================================

7.
Dear Paul,

My first thought is that logistic regression is not the way to handle the
sort of data you have.  However, you don't say *why* you are doing this
analysis so I don't know your research question and it may be the best way
to approach the questions that actually motivate the study.  In any case I
will be most interested in the responses to your enquiry because a very low
rate of positives is a feature of many datasets that have come my way.  In
case I miss subsequent contributions in Allstat I'd appreciate a summary of
what you learn.

My way of characterizing your problem is to think of 200 events distributed
over a 10-dimensional space.  A natural question would be to consider if
they cluster in particular regions of that space, taking account of the
shape of the space itself and of how it has been sampled.

The shape of the space could come from PCA and if variables are correlated
redundancy can be reduced and the problem simplified by substituting a
smaller number of uncorrelated dimensions. This could be a starting point
for various analyses.

However, the approach that seems most direct is Discriminant Analysis since
it directly asks how best to distinguish the two populations of 'quadrats'
in the 10 dimensions.  Since it's a linear model (but so, really, is the
one you are
considering) it will only work well if the difference between the
populations can be expressed as a difference in their centroids.  It would
not work if the positives are 'clumped together' in various nooks and
crannies of the space.
In that case you might need some form of clustering approach.

Sandy, M.
[P.B. seems to hinge on the optimality or otherwise of linear (least
squares) DFA vs Logistic regression as DFA ... As to the redundancy issue -
I only have about 8-10 predictors!]

============================================================

8.
Hiya,

You want to look at oversampling.  Check out the following references:

King and Zeng in http://gking.harvard.edu/files/0s.pdf
Fitting logistic regression models under case control sampling, Scott and
Wild, JRSS B, 1986, 48, 170-182 Logistic disease incidence models and
case-control studies, Prentice & Pyke, Biometrika, 1979, 68, 3, page 403-11
Weiss and Provost in
http://www.research.rutgers.edu/~gweiss/papers/ml-tr-44.pdf

I am sure there are more....

Paul T.
[P.B. Bingo - these two sets of papers are exactly what the doctor ordered.
The King and Zeng papers are excellent - containing a correction for
oversampled (better thought of as a high ratio of non-occurrences to
occurrences). But, the Weiss and Provost is "it" in a nutshell. The title
of this paper is: "The Effect of class distribution on classifier learning:
an empirical study." From the abstract ... "This study shows that the
naturally occurring class distribution often is not best for learning, and
often substantially better performance can be obtained by using a different
class distribution". Their simulations are done using CART methods - but
the principle seems to be the same for logistic regression classifer
construction.]

============================================================

9.
Dear Paul

I have not come accross either the bootstrap scenario you are using and my
feeling is that you should perform the bootstrap on the original dataset
and not subsampling the "0" case data. Obviously if you carry out the
classical bootstrap you run the risk of not picking up any "1s" and the
whole estimation process will fail.
On the other hand doing it your own way you run the risk of over-estimating
the probability of presence in your equation.

I think it is the situation of a finite mixture problem, where you consider
that your sampled locations can be splitted into two populations; the first
one being the locations the fish exist and the second one being the fish do
not exist in that location. Having formed the likelihood for the two
populations, you can proceed by optimizing the mixture density using either
the EM algorithm or simulation.

Alternatively you could be a Bayesian to overcome the estimation problem.
You can have a look at "Regression analysis of count data" by Cameron &
Trivedi

Hope this is of some help to you ...

Best wishes
Dimitrious, L.

[P.B. Ok - the finite mixture stuff may be a goer - and the Bayesian stuff
- who knows?!! I just want a classifer that will fit incoming new data,
with some robustness. It would appear to be that anwers #2 and #6, and #8,
provide the most direct solutions].


So - I'm now setting up a small empirical study examining the utility of
simply using the base-rate as the classifier threshold on the unequal-count
data, using ROC to locate a possible optimum threshold, vs 50/50 class
distribution selective sample bootstrapping a la Weiss and Provost, using
CART and Logistic methods, on the some simulated and real datasets.

Many, many thanks to everybody who replied - and I hope the above is
helpful to some who were curious as they have the same kind of problem as I
do!

Regards ... Paul
_____________________________________________________________________
Paul Barrett
Adjunct Professor of Psychometrics and Performance Measurement
University of Auckland

email: [log in to unmask]             DDI: +64-(0)9-238-6336
       [log in to unmask]                    Fax: +64-(0)9-353-1681
Web:   www.pbarrett.net                        Mobile: +64-021-415625

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

April 2024
March 2024
February 2024
January 2024
December 2023
November 2023
October 2023
September 2023
August 2023
July 2023
June 2023
May 2023
April 2023
March 2023
February 2023
January 2023
December 2022
November 2022
October 2022
September 2022
August 2022
July 2022
June 2022
May 2022
April 2022
March 2022
February 2022
January 2022
December 2021
November 2021
October 2021
September 2021
August 2021
July 2021
June 2021
May 2021
April 2021
March 2021
February 2021
January 2021
December 2020
November 2020
October 2020
September 2020
August 2020
July 2020
June 2020
May 2020
April 2020
March 2020
February 2020
January 2020
December 2019
November 2019
October 2019
September 2019
August 2019
July 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
September 2017
August 2017
July 2017
June 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
2006
2005
2004
2003
2002
2001
2000
1999
1998


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager