Dear all
Thank you for your useful responses:
Below are the texts of the responses
Pamela McHugh wrote:
Missing Data has provided good material for on-going debates at our
company. I thought I would share with you one of the ascription
techniques
we use that has been more widely accepted by our clientele:
First determine key independent variables. For our financial
research we
create a grid of age by income by geographic region. For key
missing
values we look to the mean (some think we should use median) value
of that
particular variable for their age x inc x geo cohort. That is, if I
am 25
making $200k and living in the Florida (we are all allowed to dream
aren't
we?) but I didn't provide the balance in my savings account we would
ascribe a balance which represents the mean of balances reported by
other
households we interviewed in the southeast USA region aged 20-30
with
incomes $200-$300k. There are 2 additional criteria: first, the
sample of
hhs in that cell that reported the variable we need to ascribe needs
to be
large enough (we use a minimum of 15); second, the grid contains
only hhs
interviewed during the same time period. As you note, the element
of time
allows uncontrolled events to impact the data.
Hope this practical as opposed to statistical solution provides some
Adrian Mander wrote:
Stata's commmand impute is quite dangerous! It is predicting data
using regression but assumes that
the data you observe is the true population and hence underestimates
the variability of the data.
I do approve of multiple imputation though :) this does produce the
correct standard errors.
and in response to a query about hotdeck imputation
Hotdeck imputation is most efficient if the missing data pattern
does not leave you with an extreme amount of lines
of missing data. It works best when only one variable has missing
data. Additionally it presumes the parameter of
interest in the model command is normally distributed.
Ruth Pickering wrote:
If you used multiple imputation I don't think there is a problem
using
future values to predict missing data, so I guess its probably OK
for single
imputation as well. Have you looked into multiple imputation?
Terence Iles wrote:
I think your imputation process may lead to misinterpretation. Both
Factor analysis and Cox regression are analysing the correlation
structure of the data. By imputing using multiple regression you
have
clearly enhanced the inter-correlations. This will be even worse
after
the second visit. As for the use of the imputation procedure for
missing
insulin values, where clearly things might be different if
measurement
were possible, your procedure is open to question
I have two suggestions. One is to 'jitter' the imputed values (eg by
adding a random normal deviate with zero mean and an appropriate
SD from the multiple regression). This does not entirely get round
the
enhanced correlation question, but the it helps. The second is to
compare analyses from the complete data set with the set including
the
missing values.
It will be interesting to see what the rest of the list thinks.
The original Query
am reanalysing a data set with 750 subjects and 15 variables, 2
categorical ( variables include age smoking alcohol and exercise
cholesterol, triglycerides, and other metabolic markers) about 25 % of
subjects are missing from upto 6 of these variables at baseline. Sometimes
data is missing by design, for example 10% of subjects did not have insulin
measured because of a incompatable test. Additionally I have data on 500
subjects at a second follow-up visit 2 years later. Subjects are then
followed up to the onset of coronary heart disease.
The data is observational, in that there was no intended
intervention other than being screened every 2 years. There may well have
been doctor's advice to lose weight, drink less or see a specialist, but the
advice given is not very well recorded.
The original analysis was to estimate missing values from the
remainder of the baseline variables using STATA's "impute" command.
("impute" predicts the missing values using multiple regression for data
with different patterns of missing data)
Then on the imputed values perform a factor analysis on some of the
variables to extract 2 factors, and then do cox regression on these 2
factors and the remainder of the variables.
However I have since thought that one could use the data from the
second visit for the estimation of missing values. Either by direct
substitution, and adjusting the survival/ censorship time. Or by estimating
the baseline missing values using both the follow-up values and the
non-missing follow-up values in "impute". However after two years lifestyle
changes may have affected the levels of the variables.
* If one has follow-up data is it allowable to estimate baseline data
from the follow-up data in addition to other baseline data?
* Should one estimate the missing Insulin?
I also wonder if it would be better to estimate the factors rather
than the missing components. This is suggested in the STATA manual
|