Thanks to all respondents from SAS-L and ALLSTAT to my query.
As a summary was requested, here it comes. I have also copied
important parts of original mails below.
-------------------------------------------------------------------
Summary (mostly modified snips from original anwers)
-------------------------------------------------------------------
1. This is a moderately controversial question.
2. Positions:
- automatic variable selection procedures are evil
- neither model selection does particularly well
3. What to do?
- careful examination of each of all variables, in terms of predictive
power, relationship with other covariates and meaning.
- run a series of logistic regressions on the dependent variable with
subsets of explanatory variables, and then choose a more limited list
of explanatory variables for the final logistic regression based on
the results of the logistic regressions of the subsets of explanatory.
- explore potential multicollinearity using PROCs VARCLUS, FACTOR and
PRINCOMP, as well as applying the COLLIN, TOLERANCE and VIF options in
PROC REG. (Since multicollinearity is an issue only among the
variables on the "right hand side" of the equation, may researchers
apply PROC REG with these options to explore multicollinearity before
moving on to a logistic regression analysis.)
- make sure that the models represented the state of knowledge of the
ecological processes, and that the resulting formulas had coefficients
which made sense in terms of the processes.
- Latent Regression ( Logistic ) by deriving a group of Latent
Variables based on set of selcted principle components having 70 - 80
per cent communality and Use them as indenpendent varaibles for
carrying logistic Regression.
-------------------------------------------------------------------
Parts of original mails follow
-------------------------------------------------------------------
---------------------------------------------------
-- Anthony Staines --
---------------------------------------------------
This is moderately controversial question. My own views follow.
There is fair evidence, to which I think you have just added!, that
automatic variable selection procedures are evil. For cetain specific
purposes, notably prediciton, they have some value, although
overfitting
is an ever present menace. In aetiological work they are of little
value.
I would suggest a careful examination of each of your variables, in
terms of predictive power, relationship with other covariates and
meaning. Some people swear by data reduction methods, but I have no
real
experience of these techniques.
---------------------------------------------------
-- Nick Longford, De Montfort U., Leicester --
---------------------------------------------------
I have worked on a similar problem with ordinary regression, and came
to the
conclusion that neither model selection does particularly well. I
will mail the manu-
script to you (to appear, subject to a new changes). Which model to
select? Any one
you select, by whichever method, with be unsatisfactory, because:
1. There is no certainty that you've got the right model, and the
more complex the
selection process is, the less likely you are to identify the right
(or an appropriate)
model.
2. Some bad models are very suitable for certain inferences
3. You could do much better by selecting a few candidate models.
Using one model for
all subsequent inferences is like placing all your eggs in one
basket. And that basket
has not been inspected very well.
---------------------------------------------------
-- Joe McCrary --
---------------------------------------------------
I would probably make those selections based on whatever knowledge you
had. As you mention little is known, I might be more inclined to run a
series of logistic regressions on your dependent variable with subsets
of explanatory variables, and then choose a more limited list of
explanatory variables for the final logistic regression based on the
results of the logistic regressions of the subsets of explanatory.
---------------------------------------------------
-- N. S. Gandhi Prasad --
---------------------------------------------------
I sugeest you to try Latent Regression ( Logistic ) by deriving a
group
of Latent Variables based on set of selcted principle componnents
having
70 - 80 per cent communality and Use them as indenpendent varaibles
for
carrying logistic Regression.
I hope this may give some better results as it may take care of
multi-colliearity or other isssues.
---------------------------------------------------
-- Andrew H. Karp --
---------------------------------------------------
>The explained variation is comparatively small (Nagelkerke R^2 about
>0.1).
This measure is not a "variance explained" statistic similar to the
R-square or
"coefficient of deterimination" measure we use in a multiple linear
regression
model.
PROC LOGISTIC generates two "r-square like" statistics when the
RSQUARE option
is included in the MODEL statement. Their computation is described in
the
SAS/STAT Version 8 documentation on page 1948. The first measure
(Cox-Snell)
is used to assess models where all the independent variables are
continuous and
the second (Nagelkerke) is used where there are one or more binary
independent
variables in the model.
These statistics have at their core the ratio of the likelihood
function of the
fitted model to the likelihood function of an intercept only model.
What that
are actually measuring is the proportion of change in the likelihood
function
of the specified model vs. no model at all.
I don't like these statistics very much, and like them even less
because their
names suggest they are analogous to the "variance explained" measures
used in
linear models, but they are actually measuring something else.
There was a very good article in the Feb 2000 issue of The American
Statistician by Scott Menard called "Coefficients of Determination for
Multiple
Logistic Regression Models," which may be of use. You might also want
to look
at Paul Alllisons "Logistic Regression Analysis Using the SAS System,"
published via SAS Instittue's Books by Users Program. The publication
number
(worldwide) is 55770. Another good text, updated for Version 8, is
Stokes, et.
al, "Categorical Data Analysis Using the SAS System," is also
available from
SAS Instittue (but I don't have the pub number handy). Finally, J
Scott Long's
book, "Regression Models for Categorical and Limited Dependent
Variables,"
published by Sage, is often useful.
In general, finding the "optimal" subset of predictor or independent
variables
in a logistic regression analysis presents the same problems presented
in a
linear modeling scenario. We need to be concerned about
multicollinearity as
well as the substantive relevance of the variables we choose.
Assuming that (at least some) of your independent variables are
measured on a
continuous scale, you might want to explore potential
multicollinearity using
PROCs VARCLUS, FACTOR and PRINCOMP, as well as applying the COLLIN,
TOLERANCE
and VIF options in PROC REG. (Since multicollinearity is an issue
only among
the variables on the "right hand side" of the equation, may
researchers apply
PROC REG with these options to explore multicollinearity before moving
on to a
logistic regression analysis.)
I hope this brief reply is of assistance.
---------------------------------------------------
-- David Edgerton --
---------------------------------------------------
I've worked with a similar problem - see for example the following
working paper
http://www.nek.lu.se/nekded/Research/Publications/cutbacks.pdf I
haven't got anything more recent written down, but in my present
research I have been using stepwise procedures -using LM tests on the
forward parts and Wald tests on the backward parts. This makes
computation much simpler - I was using Limdep not SAS which I found
very clumsy. One reason I was not using SAS was due to my choice of
stopping rule. I did not use significance tests but the Bayes-Schwartz
information criterion (the reason for thus is that it is a consistent
method in deciding the lag-length of AR models, so we might hope it
has reasonable properties in other situations).
Notice (i) that my use of dummies means that I also had to test for
equality between parameters, not just that parameters were zero and
(ii) I was also considering 2nd, 3rd and 4th order interactions, not
only main effects. There is a reference to Nordberg (1981) in the
working paper, which takes up
the properties of variable selection in logit models using LR tests.
---------------------------------------------------
-- John Hughes --
---------------------------------------------------
You pose a common problem. I'm not surprised that forward and
backward
selection produced different models.
A more usual method and the one I have used myself with PROC LOGISTIC
when presented with a large number of explanatory variables is to
follow
these steps.
1. Firstly reduce the number of potential explanatory variables to
about ten. To do this put each of your explanatory variables in a
model
by itself with your dependent variable. If this variable is
significant
at say p=0.1 then you may be able to use this variable at a later
stage
with any other explanatory variables that are significant at this
level.
2. If you still have a relatively large number of explanatory
variables, perhaps more than ten, there is almost certainly going to
be
some correlation between some of them. If two explanatory variables
are
highly correlated with each other then only one of them should go into
your model for final selection. Choosing the one that goes in should
be
a clinical rather than a statistical decision. However see Collett D.
Modelling Binary Data. Chapman and Hall 1991 for potential
'confounding
variables'.
3. Now the reduced number of explanatory variables can be fitted in a
backward or forward stepwise fitting procedure. The usual level to
stay
or enter for an explanatory variable is p=0.05. This would be
suitable
in your case where there appears to be no shortage of explanatory
variables! If there had been fewer explanatory variables p=0.1 may
have
been an appropriate level just to see 'what was there'. This does not
seem to be necessary with your data.
Almost inevitably forward and backward selection will produce
different
models. This is not necessarly a problem. The different explanatory
variables in each model may be correlated and the selection process is
choosing one in preference to the other. Further examination on
clinical criteria may help you choose between them. In fact if you
think one or more are clinically important SAS will allow you to force
them into your model.
I hope this has been of some help to you. I can recommend the book by
Collett mentioned above.
---------------------------------------------------
-- Nicole Augustin --
---------------------------------------------------
One reason for getting extremely different models in your variable
selection
could have to do with the so called Hauck-Donner phenomenon (Hauck,
Jr.
W.W. and A. Donner. Wald's test as applied to hypothesis in logit
analysis. Journal of the American Statistical Association. 1977. 72,pp
851--853.)
In some software packages the Wald approximation to the
log-likelihood
is used in the selection algorithms. For large estimated coefficients
the Wald approximation may underestimate the change in log-likelihood
ratio and this may yield non-significant t-tests. So if SAS forward
and
backward selection do not use the same test (Wald-approximation or
exact) this could explain why you are getting different models.
This problem is quite well explained on p. 237 in:
Venables, W.N. and Ripley, B.D. Modern Applied Statistics with
S-PLUS.
Springer Verlag, Heidelberg. 1997, 72, 851--853.
Ripley has also written a function called stepAIC() for Splus and R (R
is public domaine software) which evaluates an exact likelihood ratio
test in the selection procedure.
---------------------------------------------------
-- David Cassell, CSC --
---------------------------------------------------
One can always find a dozen or so models with similarly-high R-squared
values [or in this case, ratios of likelihood functions]. But that
does not mean that any of these models is the "optimal" model. Let me
say as a statistician that the science is more important than the
stats when building these models. If you cannot use the science to
build and weed out models, then you probably should not be building
models like this, period.
Try something like PROC PLS, which does not do exactly the same thing.
I just attended last week a talk by statistician John van Sickle on a
model-building problem he worked on. He focused primarily on making
sure that the models represented the state of knowledge of the
ecological processes, and that the resulting formulas had coefficients
which made sense in terms of the processes. In other words, he focused
on the scientific side, rather than the stat side. And it took him
weeks [maybe months] of serious investigation. That is the way models
should be built. The throw-everything-into-the-cauldron-and-stir
approach just asks for bad things to happen.
BTW, if you look up the proceedings of SUGI 25 in the stat section,
there is a paper by E.S. Shtatland et al on the subject of the
R-squared-like measures in PROC LOGISTIC and PROC GENMOD which might
be of interest. It is probably on the SAS website.
---------------------------------------------------
-- Peter L. Flom, PhD --
---------------------------------------------------
... if no one has suggested it yet, you might try looking at Frank
Harrell's Regression Modelling Strategies, published by Springer
---------------------------------------------------
-- ORIGINAL QUERY: D. Alte --
---------------------------------------------------
we have analysed data from a cross sectional survey (N=4310)
that includes examination of dental health.
We explored the association of occlusal factors
with craniomandibular dysfunction (the 'dependent': CMD prevalent
yes/no).
As not much is known about these associations,
we put a large number (~50) of explaning variables
(sociodemographic + occlusal factors)
into a logistic regression model (using SAS proc logistic)
and used different variable selection methods:
- stepwise forward or
- backward.
With backward we got in different models usually a
higher number (7-8) of statistically significant variables
than with forward selection (2-4).
The explained variation is comparatively small (Nagelkerke R^2 about
0.1).
Q: Which selection methods would you propose and what effects do they
have?
How should significance levels for stay or entry (slstay, slentry) be
set?
-----------------------------------------------
Dietrich Alte (Statistician, Dipl.-Stat.)
University of Greifswald - Medical Faculty
Institute of Epidemiology and Social Medicine
Walther-Rathenau-Str. 48, D-17487 Greifswald, Germany
Phone +49 (0) 3834 - 86 77 13, fax +49 (0) 3834 - 86 66 84
Email [log in to unmask]
Institute http://www.medizin.uni-greifswald.de/epidem/
Study http://www.medizin.uni-greifswald.de/epidem/ship.htm
-----------------------------------------------------------------
|