Arash Rashidian writes:
>I know there are some questions and disagreements on different ways of
>sample size calculation for a regression model. If I take the example of
>a linear model with k number of independent variables; some say 3*k=n is
>the minimum necessary for a good power, and some say k*k*k is enough.
>The difference is huge and the researches approach the question very
>differently. Many published data even do not speak about sample size
>calculation.
>
>I've got two questions and your responses are very appreciated:
>
>1) are you aware of any published material on the issue of sample size
>calculation for regression models? the most that I've seen don't go
>further than a linear model with 1 independent variable.
>
>2) Another possibility is to break down the question to a single
>coefficient (Bi) - the coefficient which is the most important one for
>the research objectives- and do a real calculation on that. Do you know
>any reference for it or a practical example?
If you use either of the above formulas with k=1, you will see pretty
clearly that neither 3*k nor k*k*k is a reasonable formula for sample size.
There are two issues here: stability and power. If the number of independent
variables is large relative to the number of observations, then the model is
unstable and likely to do poorly in a replication study.
In general, you need 10 to 15 observations per independent variable to get
good stability. I don't have a good reference for this in front of me, but
it is a rule commonly cited by statisticians.
Power is much trickier issue. You need to specify variability of the
residuals as well as variability of the independent variable(s). If you have
more than one independent variable, then you also need to know something
about how the independent variables are related to one another.
There are some ways to simplify. One is to look at R-squared or partial
R-squared. The disadvantage of this approach is that R-squared is unitless,
and you should really specify power in a context that physicians can
understand. That context would almost always involve a measure of clinically
relevant change in the units of the outcome variable. For example, if you
are studying the effect of maternal age on the duration of breastfeeding,
you might want to be able to detect a slope of at least 0.25 weeks of
breastfeeding per year of mother's age. That might translate to an R-squared
value of 10%.
Sometimes, you can show equivalence between a regression model and a t-test,
especially if some of your independent variables are categorical. Then you
can use the standard formulas for power and sample size. Sometimes, you can
make conservative assessments to simplify the calculations. A lot depends on
the context of the particular problem.
If you are planning your own research study, you should invest in some power
calculation software like nQuery advisor, or pay a professional
statistician. If you are trying to assess sample size in published research,
then it is much easier. Just look at the width of the confidence intervals
for the regression slopes. If the intervals are narrow (wide), that is a
pretty good indicator of the adequacy (inadequacy) of the sample size.
Sorry that I don't have any good references. I have a few web pages that
talk about power, but none that address the more complex situation of a
regression model.
Steve Simon, [log in to unmask], Standard Disclaimer.
STATS: STeve's Attempt to Teach Statistics. http://www.cmh.edu/stats
Watch for a change in servers. On or around June 2001, this page will
move to http://www.childrens-mercy.org/stats
|