I am using a case-control approach to look into geographical risk
factors. The data is a point dataset comprising a source and a case
which are separated by distance (/d,/ range 3 to 200). Controls are
matched to each case where the distance from the control to the source
(/dc/) is within 0.25 of /d/. This selection procedure has generated
multiple controls for some cases, with the number of controls per case
(/m/) ranging from 1 to 150. Furthermore /m/ and /d/ are positively
correlated.
The intention of these analyses is to use case/control as the outcome
variable with a set of predictor variables (the geographical risk
factors). However the geographical risk factors are also positively
correlated with /d/. A greater number of controls at large values of /d/
as well as greater values of the predictors create clear problems for
analysis. Different numbers of controls for each case further cloud the
picture. Various solutions have been suggested each with problems or
concerns:
1. Randomly select up to 5 controls per case and analyse the data using
conditional logistic regression. This would discard valuable control
data and would be heavily dependant on the random selection.
2. Randomly select 1 control per case, analyse it using conditional
logistic regression and bootstrap so the procedure is repeated many
times. I am concerned about the different numbers of controls that
depending on the value of /m/ for an individual strata the probability
of selecting a particular control ranges from 1 to 1/150 and this will
introduce its own bias.
Any suggestions on this problem would be greatly appreciated.
|