Dear Allstat,
In brief:
Does anyone have any advice as to the recommended minimum sample
size/event rate for ML log reg when all predictors are
binary/categorical (not continuous)?
Does anyone have experience of using Cytel's LogXact (or another
package) for analysing rare events log reg?
Given the existing cases per cell / events per variable considerations
for ML log reg, what, if any, are the sample size requirements for exact
log reg, given that according to Cytel it can be used for small/sparse
data sets? Just how 'rare' can the 'event' be?
In more detail:
When providing estimates of required sample size for logistic regression
I find myself having to apply rules of thumb (see below for a brief
summary of these). I would prefer to perform formal sample size
calculations but it's my understanding that (according to Hosmer &
Lemeshow's Applied Log Reg 2nd Edn) currently "...the only sample size
results available are for multivariable models containing continuous
covariates that are assumed to be distributed normal, exponential or
Poisson." Due to the nature of the 'event' in question or the
constraints of time and funds, our data sets generally only contain
between 200 and 500 patients, therefore I convert any continuous
predictors to categorical ones with 3 or 4 levels, because as far as I'm
aware to include them 'as is' would run the risk of rendering any
goodness of fit statistic meaningless in small data sets. It has also
generally been our experience here that ML log reg finds it difficult to
accurately predict 'events' when they form less than around 30 percent
of the overall sample (as is often the case). I have just come across
Cytel's LogXact brochure online (www.cytel.com <http://www.cytel.com/> )
and was wondering whether anyone had experience of using this software,
which is apparently capable of performing exact logistic regression,
which they say is more suited to small or sparse data sets, or those in
which the event of interest is rare. I am particularly interested to
know what the sample size requirements would be for exact log reg, given
the existing factors to be considered for ML log reg.
Considerations for sample size in ML log reg:
Glantz & Slinker (1990) state "one cannot associate a P value with a
goodness-of-fit statistic in logistic regression when the total sample
size is below about 80 individuals, and much larger sample sizes are
desirable" (their emphasis).
They also point out that for every cell of the model you need 5 cases
for goodness-of-fit tests. So a model with 6 binary predictors would
require 5 x 2 x 2 x 2 x 2 x 2 x 2 = 320 subjects to reasonably assess
goodness-of-fit.
In addition, Peduzzi et al. (1996) recommend that the number of 'events'
per predictor variable should be at least 10 to avoid problems of over-
and under-estimated variances.
There are numerous instances in the literature where neither of these
requirements appear to have been satisfied, which makes me suspect that
the results from many models might not be stable - and in cases where
the 'event' in question is rare it is difficult to test the model on new
data because not enough new data may exist.
Any advice or comments on this subject would be most appreciated, I will
summarise any responses to the list.
Regards
Liz Hensor
Dr Elizabeth M A Hensor PhD
Data Analyst
Academic Unit of Musculoskeletal and Rehabilitation Medicine
36 Clarendon Road
Leeds
West Yorkshire
LS2 9NZ
Tel: +44 (0) 113 3434944
Fax: +44 (0) 113 2430366
[log in to unmask]
|