Dear Sir/Madam,
I have a few technical quires about the FSL Randomise algorithm. I am
a mathematician by training and although statistics is not my main
area of expertise, I know the basics fairly well. I have read the user
manual on your website and I have studied the FSL statistics summer
course notes and I gained some understanding on how Randomise should
be used, but some things are still not clear to me and I would really
appreciate it if you could clarify them for me.
1) Which distribution-free equivalents of parametric tests are
actually available? I apologize for asking such a primitive question,
but even after reading the manual pages this remained unclear to me.
This is how I understand it: Randomise recognizes the design matrices
that correspond to 2-sample unpaired t-tests with and without nuisance
variables, 2-sample paired t-test and repeated values ANOVA. All the
rest of the design matrices are recognized as not fitting the criteria
for the above tests and are treated in the same way by fitting a
linear regression model to the supplied data and then performing
individual t-tests to identify the factors with coefficients
significantly different from zero. Could you confirm that this is
correct? Also, the output of Randomise are the TFCE corrected p-values
for all contrasts specified in the contrast matrix. What is not clear
to me is how to obtain the goodness of fit statistic (R-squared test
perhaps) which would tell me whether it would be acceptable to use the
corresponding t-test results.
2) The next question I have is whether Randomise distinguishes between
categorical, ordinal and continuous data. If so, I would like to find
out how to construct the design matrices in such a way that a column
with values {1,2,3} would be interpreted and treated as {'drinks tea',
'drinks coffee', 'does not drink caffeinated drinks'} rather than
{'small', 'medium', 'large'}. The type of data is very important for
the choice of statistical test and it would be useful if I got a
confirmation that ordinal data is not treated as categorical, and
discrete is not treated as ordinal. It would be great if you could
give me a set of instructions on how this should be reflected in the
design matrix.
For example, I have the following data set: TBSS skeletons for
patients with motor neuron disease (MND) and healthy controls. I also
have data for the MND patients on where the disease started: hands,
feet or bulbar. Suppose I have 6 patients with MND (2 with each
possible disease initiation type) and 2 controls. Here are the
possible ways to code this data set into a design matrix:
1 0 1
1 0 1
1 0 2
1 0 2
1 0 3
1 0 3
0 1 0
0 1 0
where the first and second columns indicate whether the participant
has MND (1,0) or is a healthy control (0,1), the layout also suggests
that the data is ordinal and all participants' data is exchangeable
for permutation reasons; this question is addressed in the next bullet
point. The third column indicates the disease initiation site:
1-hands, 2-legs, 3- bulbar, 0-data unavailable since healthy controls
do not have MND. The question is does the zero indicate missing data
or will it be treated as a group type? If it is, this invalidates the
model.
Another way to code the data is to specify that the MND and control
groups are not exchangeable for permutation reasons and all
permutations should be done within a group:
1 1
1 1
1 2
1 2
1 3
1 3
2 0
2 0
where 1 stands for MND and 2 stands for control.
Alternatively we could code the "MND vs controls" variable as
categorical: 'a'-MND patient, 'b'-control. Once again it would be
useful if you could confirm that I understand the encoding correctly!
a 1
a 1
a 2
a 2
a 3
a 3
b 0
b 0
The data could also be coded as described in "Two-Sample Paired T-test
(Paired Two-Group Difference)" section of the user manual, i.e.
treating each possible value of disease initiation site as a separate
variable:
a 1 0 0
a 1 0 0
a 0 1 0
a 0 1 0
a 0 0 1
a 0 0 1
b 0 0 0
b 0 0 0
with this layout we no longer have the problem of falsely coding for
disease initiation site for healthy controls. However, please correct
me if I am wrong, we now have the problem of interdependency of the
factors. Multi-linear regression analysis is only available for data
with correlations, but not for data that has a strict dependency on
each other (e.g, the probability of disease initiation in legs is zero
if it initiated in hands). Could you please confirm that Randomise is
not treating the last 3 columns as separate variables and is somehow
combining them into one when it performs the statistical analysis?
Needless to say, when I tried all of these methods I got very
different results.
3) From the way Randomise is set up it follows that the voxels of TBSS
skeletons are treated as response variables, while the columns of the
design matrix are the explanatory variables. For a regression model
with only 1 explanatory variable the statistical significance would be
the same if the response and explanatory variables were swapped.
However, the same is not true for multi-linear models with several
explanatory variables. Now if I wanted to test whether changes in
brain structure affected cognitive ability, I would want to define FA
in a voxel as a predictor variable and cognitive scores as multiple
responses. For that I would like to perform a MANOVA test. Do I
understand correctly that this option is not available in FSL? If I am
wrong, would it be ok to give some more information on where I could
find information on this test?
4) I am currently working with a large data set of 639 normal ageing
subjects and am using Randomise to analyze the TBSS data and it's
association with cognitive test scores. My first target was to analyze
the associations of white matter integrity and a range of cognitive
tests individually. The way I constructed my design matrices was as
follows: I included only one column containing the scores from a
cognitive test and ran Randomise (without demeaning), and obtained no
significant voxels. A colleague then suggested that I include a vector
of ones in front of the cognitive scores in the design matrix. We both
thought that it would not make any difference, but surprisingly (to
us) it did and I then got a number of significant voxels. However, I
cannot find anywhere what the column of ones actually does and whether
it makes the statistical test more or less valid for my purposes. I
will be extremely grateful if you could provide me with some
guidelines on the significance of this column and the nature of the
cases when it needs to be included.
5) Is demeaning the data equivalent to adding a column of ones in the
design matrix?
6) The user manual indicates that the data needs to be demeaned
whenever we are not testing for the design matrix mean. Does this
apply to categorical and ordinal data? What happens if one demeanes a
design matrix that contains both categorical/ordinal and continuous
data, does it subtract the mean of each column from the corresponding
values, thus potentially making ordinal data continuous (e.g.
{1,1,1,2,2} becoming {-0.4 ,-0.4 ,-0.4 , 0.6 , 0.6)?
Thank you very much for your patience and thank you for taking time to
refer to these questions.
Best wishes,
Ksenia (Kate) Andreyeva.
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
|