Okay, the "neuropsychological test battery" was a really bad example. If your battery is large enough, and you correct for multiple comparisons, it is very unlikely to find any true effects. So I would not have corrected on F-test level (would have reported both corrected and uncorrected p-values).
Anyway, I had thought that I would have to take into account the number of post-hoc tests (as well), but this seems to be wrong then.
Back to the fMRI data. Imagine a purely within-subject 3x3-ANOVA, which should be reasonable nowadays. E.g. something like "face" (happy, sad, fearful), and "sex" (male, female, morph). Maybe I have specific hypotheses, but maybe I do not (at least for some levels, e.g. concerning "morph"). In the latter case, I would run F-tests for "face", "sex" and the interaction. Imagine I get some clusters surpassing an otherwise defined voxel-size threshold. What should I do then?
Or should I run lots of t-tests right from the beginning? I would already have to conduct 12 one-sided tests for "face" and "sex". And to ensure that the results make sense, I would have to check all the interactions as well.
|