Hello,
I will cast my late penny into this topic. I believe that the questions
asked are 3.
1) What threshold should I use (0.05, 0.01 etc.)
2) Should I correct or not for multiple comparisons.
3) If yes, what correction should I apply (FWER, FDR) and why.
I think these are important questions to which is quite difficult to answer
in a clear manner, avoiding the usual statistical mumble-jumble. Given the
very high recurrence of this and related questions (p_values corrected vs.
uncorrected, FDR, SVC, multiple comparisons and Bayes etc.) on this list and
in my mailbox, I have deposited my understanding of the topic in a paper
that
has been submitted recently and that can be found in my home-page at
http://www.neurogenetics.net/Department.html.
(Title: "On the Logic of Hypothesis Testing in Functional Imaging." It is
directed to practitioners and does not contain maths. All comments and
feedback are welcome)
As far as the questions go, here are some short answers.
Question 1:
There is no agreement on what threshold should be selected. Historically,
the first value used was 0.01 by Laplace in study on how the moon affected
barometric pressure. Sir Ronald Fisher went on suggesting 0.05 (in the
famous "Ladies tasting tea" experiment) although in his books you may find
0.1 as well. As a further example, In "Statistics for Experimenters," the
authors (Box, Hunter, Hunter) say "one begins to be slightly suspicious of
a discrepancy at the 0.20 level".
Sorry.
Question 2
Formally, multiple comparison procedures (MCPs) are meant to test the Global
Null Hypothesis (no signal in the brain or in the area you are testing) with
some localization power because they can control the type I error
voxel-wise. It makes sense to apply them if the the Global null hypothesis
is a reasonable alternative. However, in most cases the question asked is
not about the presence or not of signal in the brain, but is about its size
and location. This is a harder question to ask and the reason is that all
that statistics can do is test the compatibility of the data with a certain
model. Change the model and the answer is different (this is called the No
Free Lunch Theorem).
A way around the problem is to put aside the Null Hypothesis viewpoint and
look at the thresholding in a different way. By changing thresholds what you
actually do is to penalize your stats according to some a-priori
assumptions. One can show that MCPs are the right penalizations if you
expect sparse signal with higher signal to noise ratio. If these are your
expectations, you may want to have higher degrees of freedom to feel more
comfortable about the quality of the results. However, this may not always
be appropriate. If your expectations are different, best way is to set up a
Monte-Carlo simulation and see which penalty (threshold) works best.
It is however required that all this is done before you start testing, at
the experimental design stage. Indeed, at least in the UK, Ethical
Committees will require you to do that and they have good reasons to ask you
so.
The alternative is not to use any fixed penalty, but use p-values
individually "a la Fisher" and juggle your way through the map using
whatever knowledge you have and clues from the map to state whether the
effect you see is real, probable, unlikely or totally false. This can be
justified because MCPs basic assumptions is that all voxels are equal events
with the same (null) distribution. If one has information that allows
differentiation than MCPs should not be applied because this information
would constrain the probability to that voxel alone.
Most people on this list would argue that this approach is too subjective
and prone to concocting explanations. However, as explained before, there
is no "objective" way of analyzing a map, because all depends on
assumptions. Therefore I sometimes prefer to see some good argument instead
of some funny arrangements of thresholds at different levels (voxel, cluster
etc.), SVC and so forth.
3) FDR or FWER.
FDR and FWER can be compared within the hypothesis testing framework. As
explained before, the protection of FWER implies very conservative
corrections because of its expectations on the signal. However, if a score
is over the this threshold, than one can be pretty sure that the hypothesis
is false. The FDR is different because, if one selects 0.05, than 5% or less
of the results could be false positives. This is a problem because one then
is not able to distinguish in the significant set which one is a true
finding or not. If he/she has information that allows this discrimination
than FDR and FWER should not be used in the first place. FDR is therefore an
exploratory tool. One of FDR most telling application is in Genomics. One
can scan the genetic expression of the entire human genome using
microarrays. FDR can then be used to select a smaller set of genes that can
be checked using more accurate ancillary techniques (RT-PCR etc.).
Alternatively the experiment can be repeated focusing now on the subset the
FDR has selected. This may be unlikely in imaging.
Regards
Federico
***********************************
Federico E. Turkheimer, PhD. BioInformatics Group (Head)
Neuropathology Dept. Imperial College London
Charing Cross Campus Fulham Palace Road
London, W6 8RF, UK Tel: +44 208 846 1174
Fax: +44 208 846 7794 Email:
[log in to unmask]
URL: http://www.neurogenetics.net/Department.html
***********************************
|