I sent out a query on this topic in April, and have finally had some time
to assimilate the replies I received. The summary below is offered to
anyone still interested.
(1) an overall measure of the explanatory power of the model, such as
proportion of variance explained in linear regression:
The most popular suggestion was (L0-Lp)/L0 or (L0-Lp)/(L0-Ls), where:
L0 is the log likelihood of the model with only an intercept,
Lp is the log likelihood of the model whose explanatory power is
in question, and
Ls is the log likelihood of the saturated model.
If there are no replicate observations or no variability within
replicates, the Ls=0 and the two expressions are equivalent.
This is discussed in Hosmer and Lemeshow, Applied Logistic Regression
(Wiley, 1989) p. 148. Several variations are possible; for example, log
likelihood can be replaced by likelihood-based chisquare=-2*log likelihood
without changing the ratios.
Another suggestion was to use sum(y_i-y_i-hat)^2/sum(y_i-y-bar)^2, where:
y_i is the value of the dependent variable for the i-th
observation (0 or 1),
y_i-hat is the probability that y_1=1 (equivalently, E(y_i)) from
the fitted model,
y-bar is the sample mean of the y_i's (equivalently the proportion
of 1's in the sample)
and the summations are over all observations.
This is the same formula as in linear regression, just that y_i-hat is now
the predicted value in a _logistic_ regression.
Equivalently (or almost equivalently?) one could calculate the square of
the correlation of y_i with y_i-hat.
A third collection of suggestions involved looking at the success of the
model in predicting y, using the classification table (confusion matrix,
sensitivity and specificity).
A fourth suggestion was 1-exp(-X/n)=1-(l_0/l_1)^(2/n), where:
X is the log likelihood ratio chi-square, -2*log likelihood ratio
of the model in question to the intercept-only model,
l_0 is the likelihood of the intercept model (no logs),
l_1 is the likelihood of the model in question.
This is discussed by N.J.D. Nagelkerke, A note on a general definition of
the coefficient of determination, Biometrika (1991), 78(3): 691-2 and by
L. Magee, R^2 measures based on Wald and likelihood ratio joint
significance tests, American Statistican (1990) 44: 250-3.
Several persons stated there is no consensus on this topic. Some more
references are listed below.
(2) a way to compare the contributions of two independent variables when
both (and possibly other variables as well) are in the model, such as
incremental R square in linear regression.
Overall, my correspondents seemed less sure that a good solution existed.
Here are two suggestions that were made:
Change in the Hosmer-Lemeshow quantity due to adding a variable to the
model. Of course this depends on what other variables are in the model.
The amount of change in one independent variable needed to produce the
same change in odds resulting from a unit change in another ind. var.
This, of course, is not unit-of-measurement free.
References:
Agresti, A. An introduction to categorical data analysis (Wiley 1996)
p.129.
Houweligen JC van and Cessie, S Le, Predictive value of statistical
models, Statistics in Medicine (1990) vol. 9 pp. 1303-25.
Mittlbock, M. and Schemper, M. Explained variance for logistic regression,
Stat. Med. 1996(15)pp. 1987-97.
Agresti, A. Applying R^2-type measures to ordered categorical data.
Technometrics May 1986 Vol. 28 No. 2 pp. 133ff.
Laitila, T. A pseudo-R^2 measure for limited and qualitative dependent
variables. Journal of Econometrics 1993 Vol. 56 pp. 341-356.
Ash, A. and Shwartz, M. R^2: a useful measure of model performance when
predicting a dichotomous outcome. Statistics in Medicine 1999 Vol 13 No.
4 pp. 375-84.
T. Robert Harris [log in to unmask]
Department of Mathematics
University of North Dakota
Grand Forks ND 58202-8376 701-777-2427
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|