JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for ALLSTAT Archives


ALLSTAT Archives

ALLSTAT Archives


allstat@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Monospaced Font

LISTSERV Archives

LISTSERV Archives

ALLSTAT Home

ALLSTAT Home

ALLSTAT  2002

ALLSTAT 2002

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

SUMMARY & THANKS - Query: Variable Selection in Logistic Regression

From:

Dietrich Alte <[log in to unmask]>

Reply-To:

Dietrich Alte <[log in to unmask]>

Date:

Wed, 2 Jan 2002 11:39:14 +0100

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (393 lines)

Thanks to all respondents from SAS-L and ALLSTAT to my query.


As a summary was requested, here it comes. I have also copied
important parts of original mails below.


-------------------------------------------------------------------
Summary (mostly modified snips from original anwers)
-------------------------------------------------------------------
1. This is a moderately controversial question.

2. Positions:
- automatic variable selection procedures are evil
- neither model selection does particularly well

3. What to do?
- careful examination of each of all variables, in terms of predictive
power, relationship with other covariates and meaning.

- run a series of logistic regressions on the dependent variable with
subsets of explanatory variables, and then choose a more limited list
of explanatory variables for the final logistic regression based on
the results of the logistic regressions of the subsets of explanatory.

- explore potential multicollinearity using PROCs VARCLUS, FACTOR and
PRINCOMP, as well as applying the COLLIN, TOLERANCE and VIF options in
PROC REG. (Since multicollinearity is an issue only among the
variables on the "right hand side" of the equation, may researchers
apply PROC REG with these options to explore multicollinearity before
moving on to a logistic regression analysis.)

- make sure that the models represented the state of knowledge of the
ecological processes, and that the resulting formulas had coefficients
which made sense in terms of the processes.

- Latent Regression ( Logistic ) by deriving a group of Latent
Variables based on set of selcted principle components having 70 - 80
per cent communality and Use them as indenpendent varaibles for
carrying logistic Regression.


-------------------------------------------------------------------
Parts of original mails follow
-------------------------------------------------------------------


---------------------------------------------------
-- Anthony Staines --
---------------------------------------------------
This is moderately controversial question. My own views follow.

There is fair evidence, to which I think you have just added!, that
automatic variable selection procedures are evil. For cetain specific
purposes, notably prediciton, they have some value, although
overfitting
is an ever present menace. In aetiological work they are of little
value.

I would suggest a careful examination of each of your variables, in
terms of predictive power, relationship with other covariates and
meaning. Some people swear by data reduction methods, but I have no
real
experience of these techniques.



---------------------------------------------------
-- Nick Longford, De Montfort U., Leicester --
---------------------------------------------------
I have worked on a similar problem with ordinary regression, and came
to the
conclusion that neither model selection does particularly well. I
will mail the manu-
script to you (to appear, subject to a new changes). Which model to
select? Any one
you select, by whichever method, with be unsatisfactory, because:

1. There is no certainty that you've got the right model, and the
more complex the
selection process is, the less likely you are to identify the right
(or an appropriate)
model.

2. Some bad models are very suitable for certain inferences

3. You could do much better by selecting a few candidate models.
Using one model for
all subsequent inferences is like placing all your eggs in one
basket. And that basket
has not been inspected very well.



---------------------------------------------------
-- Joe McCrary --
---------------------------------------------------
I would probably make those selections based on whatever knowledge you
had. As you mention little is known, I might be more inclined to run a
series of logistic regressions on your dependent variable with subsets
of explanatory variables, and then choose a more limited list of
explanatory variables for the final logistic regression based on the
results of the logistic regressions of the subsets of explanatory.



---------------------------------------------------
-- N. S. Gandhi Prasad --
---------------------------------------------------
   I sugeest you to try Latent Regression ( Logistic ) by deriving a
group
of Latent Variables based on set of selcted principle componnents
having
 70 - 80 per cent communality and Use them as indenpendent varaibles
for
carrying logistic Regression.

    I hope this may give some better results as it may take care of
multi-colliearity or other isssues.



---------------------------------------------------
-- Andrew H. Karp --
---------------------------------------------------
>The explained variation is comparatively small (Nagelkerke R^2 about
>0.1).

This measure is not a "variance explained" statistic similar to the
R-square or
"coefficient of deterimination" measure we use in a multiple linear
regression
model.

PROC LOGISTIC generates two "r-square like" statistics when the
RSQUARE option
is included in the MODEL statement. Their computation is described in
the
SAS/STAT Version 8 documentation on page 1948. The first measure
(Cox-Snell)
is used to assess models where all the independent variables are
continuous and
the second (Nagelkerke) is used where there are one or more binary
independent
variables in the model.

These statistics have at their core the ratio of the likelihood
function of the
fitted model to the likelihood function of an intercept only model.
What that
are actually measuring is the proportion of change in the likelihood
function
of the specified model vs. no model at all.

I don't like these statistics very much, and like them even less
because their
names suggest they are analogous to the "variance explained" measures
used in
linear models, but they are actually measuring something else.

There was a very good article in the Feb 2000 issue of The American
Statistician by Scott Menard called "Coefficients of Determination for
Multiple
Logistic Regression Models," which may be of use. You might also want
to look
at Paul Alllisons "Logistic Regression Analysis Using the SAS System,"
published via SAS Instittue's Books by Users Program. The publication
number
(worldwide) is 55770. Another good text, updated for Version 8, is
Stokes, et.
al, "Categorical Data Analysis Using the SAS System," is also
available from
SAS Instittue (but I don't have the pub number handy). Finally, J
Scott Long's
book, "Regression Models for Categorical and Limited Dependent
Variables,"
published by Sage, is often useful.

In general, finding the "optimal" subset of predictor or independent
variables
in a logistic regression analysis presents the same problems presented
in a
linear modeling scenario. We need to be concerned about
multicollinearity as
well as the substantive relevance of the variables we choose.

Assuming that (at least some) of your independent variables are
measured on a
continuous scale, you might want to explore potential
multicollinearity using
PROCs VARCLUS, FACTOR and PRINCOMP, as well as applying the COLLIN,
TOLERANCE
and VIF options in PROC REG. (Since multicollinearity is an issue
only among
the variables on the "right hand side" of the equation, may
researchers apply
PROC REG with these options to explore multicollinearity before moving
on to a
logistic regression analysis.)

I hope this brief reply is of assistance.



---------------------------------------------------
-- David Edgerton --
---------------------------------------------------
I've worked with a similar problem - see for example the following
working paper
http://www.nek.lu.se/nekded/Research/Publications/cutbacks.pdf I
haven't got anything more recent written down, but in my present
research I have been using stepwise procedures -using LM tests on the
forward parts and Wald tests on the backward parts. This makes
computation much simpler - I was using Limdep not SAS which I found
very clumsy. One reason I was not using SAS was due to my choice of
stopping rule. I did not use significance tests but the Bayes-Schwartz
information criterion (the reason for thus is that it is a consistent
method in deciding the lag-length of AR models, so we might hope it
has reasonable properties in other situations).
Notice (i) that my use of dummies means that I also had to test for
equality between parameters, not just that parameters were zero and
(ii) I was also considering 2nd, 3rd and 4th order interactions, not
only main effects. There is a reference to Nordberg (1981) in the
working paper, which takes up
the properties of variable selection in logit models using LR tests.


---------------------------------------------------
-- John Hughes --
---------------------------------------------------
You pose a common problem. I'm not surprised that forward and
backward
selection produced different models.

A more usual method and the one I have used myself with PROC LOGISTIC
when presented with a large number of explanatory variables is to
follow
these steps.

1. Firstly reduce the number of potential explanatory variables to
about ten. To do this put each of your explanatory variables in a
model
by itself with your dependent variable. If this variable is
significant
at say p=0.1 then you may be able to use this variable at a later
stage
with any other explanatory variables that are significant at this
level.

2. If you still have a relatively large number of explanatory
variables, perhaps more than ten, there is almost certainly going to
be
some correlation between some of them. If two explanatory variables
are
highly correlated with each other then only one of them should go into
your model for final selection. Choosing the one that goes in should
be
a clinical rather than a statistical decision. However see Collett D.
Modelling Binary Data. Chapman and Hall 1991 for potential
'confounding
variables'.

3. Now the reduced number of explanatory variables can be fitted in a
backward or forward stepwise fitting procedure. The usual level to
stay
or enter for an explanatory variable is p=0.05. This would be
suitable
in your case where there appears to be no shortage of explanatory
variables! If there had been fewer explanatory variables p=0.1 may
have
been an appropriate level just to see 'what was there'. This does not
seem to be necessary with your data.

Almost inevitably forward and backward selection will produce
different
models. This is not necessarly a problem. The different explanatory
variables in each model may be correlated and the selection process is
choosing one in preference to the other. Further examination on
clinical criteria may help you choose between them. In fact if you
think one or more are clinically important SAS will allow you to force
them into your model.

I hope this has been of some help to you. I can recommend the book by
Collett mentioned above.


---------------------------------------------------
-- Nicole Augustin --
---------------------------------------------------
One reason for getting extremely different models in your variable
selection
could have to do with the so called Hauck-Donner phenomenon (Hauck,
Jr.
W.W. and A. Donner. Wald's test as applied to hypothesis in logit
analysis. Journal of the American Statistical Association. 1977. 72,pp
851--853.)
In some software packages the Wald approximation to the
log-likelihood
is used in the selection algorithms. For large estimated coefficients
the Wald approximation may underestimate the change in log-likelihood
ratio and this may yield non-significant t-tests. So if SAS forward
and
backward selection do not use the same test (Wald-approximation or
exact) this could explain why you are getting different models.

This problem is quite well explained on p. 237 in:
 Venables, W.N. and Ripley, B.D. Modern Applied Statistics with
S-PLUS.
Springer Verlag, Heidelberg. 1997, 72, 851--853.

Ripley has also written a function called stepAIC() for Splus and R (R
is public domaine software) which evaluates an exact likelihood ratio
test in the selection procedure.



---------------------------------------------------
-- David Cassell, CSC --
---------------------------------------------------
One can always find a dozen or so models with similarly-high R-squared
values [or in this case, ratios of likelihood functions]. But that
does not mean that any of these models is the "optimal" model. Let me
say as a statistician that the science is more important than the
stats when building these models. If you cannot use the science to
build and weed out models, then you probably should not be building
models like this, period.
Try something like PROC PLS, which does not do exactly the same thing.
I just attended last week a talk by statistician John van Sickle on a
model-building problem he worked on. He focused primarily on making
sure that the models represented the state of knowledge of the
ecological processes, and that the resulting formulas had coefficients
which made sense in terms of the processes. In other words, he focused
on the scientific side, rather than the stat side. And it took him
weeks [maybe months] of serious investigation. That is the way models
should be built. The throw-everything-into-the-cauldron-and-stir
approach just asks for bad things to happen.

BTW, if you look up the proceedings of SUGI 25 in the stat section,
there is a paper by E.S. Shtatland et al on the subject of the
R-squared-like measures in PROC LOGISTIC and PROC GENMOD which might
be of interest. It is probably on the SAS website.


---------------------------------------------------
-- Peter L. Flom, PhD --
---------------------------------------------------
... if no one has suggested it yet, you might try looking at Frank
Harrell's Regression Modelling Strategies, published by Springer



---------------------------------------------------
-- ORIGINAL QUERY: D. Alte --
---------------------------------------------------

we have analysed data from a cross sectional survey (N=4310)
that includes examination of dental health.

We explored the association of occlusal factors
with craniomandibular dysfunction (the 'dependent': CMD prevalent
yes/no).
As not much is known about these associations,
we put a large number (~50) of explaning variables
(sociodemographic + occlusal factors)
into a logistic regression model (using SAS proc logistic)
and used different variable selection methods:
- stepwise forward or
- backward.
With backward we got in different models usually a
higher number (7-8) of statistically significant variables
than with forward selection (2-4).
The explained variation is comparatively small (Nagelkerke R^2 about
0.1).

Q: Which selection methods would you propose and what effects do they
have?
How should significance levels for stay or entry (slstay, slentry) be
set?



-----------------------------------------------

 Dietrich Alte (Statistician, Dipl.-Stat.)
 University of Greifswald - Medical Faculty
 Institute of Epidemiology and Social Medicine
 Walther-Rathenau-Str. 48, D-17487 Greifswald, Germany
 Phone +49 (0) 3834 - 86 77 13, fax +49 (0) 3834 - 86 66 84
 Email [log in to unmask]
 Institute http://www.medizin.uni-greifswald.de/epidem/
 Study http://www.medizin.uni-greifswald.de/epidem/ship.htm
-----------------------------------------------------------------

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

May 2024
April 2024
March 2024
February 2024
January 2024
December 2023
November 2023
October 2023
September 2023
August 2023
July 2023
June 2023
May 2023
April 2023
March 2023
February 2023
January 2023
December 2022
November 2022
October 2022
September 2022
August 2022
July 2022
June 2022
May 2022
April 2022
March 2022
February 2022
January 2022
December 2021
November 2021
October 2021
September 2021
August 2021
July 2021
June 2021
May 2021
April 2021
March 2021
February 2021
January 2021
December 2020
November 2020
October 2020
September 2020
August 2020
July 2020
June 2020
May 2020
April 2020
March 2020
February 2020
January 2020
December 2019
November 2019
October 2019
September 2019
August 2019
July 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
September 2017
August 2017
July 2017
June 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
2006
2005
2004
2003
2002
2001
2000
1999
1998


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager