Dear All,
I'm starting with the analysis of a longitudinal dataset (sample size=686)
in which the raw outcome (Y) appears fairly linear for most individuals
but does feature a faint concave shape over time. I'm trying to linearize
Y by raising its power (p) gradually and find the best power that can
achieve linearity across all individuals.
As an exploratory purpose, I fitted an OLS regression (Y^p)~time to each
individual separately, from which I can compute an R-squared. Doing this
for all individuals, I retrieve 686 individual R-squared for every single
value of p. Computing the average R-squared for each value of p is then
supposed to give a hint on the overall "goodness of fit" (and hence
linearity) of the transform. Obviously the individual R-squared are not
normally distributed, but I guess that with a sample size of 686, their
averages can be reasonably assumed normally distributed through CLT, so a
paired t-test that compares the average R-squared with p=1 (no
transformation) with the maximum average R-squared should be able to
detect if the difference in "goodness of fit" is significant or not and
hence choose the appropriate value of p.
Now a few questions about this:
- Does this methodology make sense at all or are there any flaws/mistakes
in the assumptions, method and conclusions? The R-squared are computed for
each individual using the same model, so I guess they can be assumed IID?
They are not exactly normal but the large sample size should invoke
CLT...hence advocate the use of hypothesis testing.
- The average R-squared on raw data is 84.4%. The maximum average
R-squared is reached smoothly (concave function) for p=1.8...but it is
only 84.7% (0.3% better). This might sound neglectable but the paired
t-test (equality of variance is satisfied) gives a p-value<<<0.01 which
yield to a significant result. Should I then still consider transforming
the data to go on with the longitudinal analysis? Is the very low p-value
due to the large sample size so that although significant, the magnitude
of the difference should be neglected?
- Is there another way to measure the quality of a transform function? I
tried to use boxcox but I'm unsure how to apply it correctly in a
longitudinal framework...
Thanks for any input or comment!
Kind regards,
Aziz Chaouch
Now I'm trying to determine a "most suitable" transform for the outcome
and thought about trying various powers (from 1 to 3) and apply OLS
regression on each individual in the dataset. This allows to compute
R-squared statistic for each individual. All R-squared are then averaged
across all indivuduals to form a single "R-squared statistic" supposed to
be representative of the overall fit of a linear model using the
determined tranform. Doing this for various powers and you get a nice and
smooth curve with a clear maximum average R-squared for the power 1.95.
Now the "problem" is that the average R-squared on raw data is already
82.9% and the maximum R-squared using outcome^1.8 is 83.3%. I guess the
difference is not worth the transform but I'm not sure and I don't know
how to test a difference in R-squared (the R-squared are not normally
distributed
--
You may leave the list at any time by sending the command
SIGNOFF allstat
to [log in to unmask], leaving the subject line blank.
|