JISCMail - SPM Archives

On Sat, Oct 30, 2010 at 7:44 PM, Vladimir Litvak <[log in to unmask]> wrote:

Dear Sherif,

You are making interesting points and I would say that this kind of issues are the reason papers are reviewed by your peer researchers and not by computer programs that check that p<0.05 for all your tests.

I've never said that all statistical images should be FWE corrected but I think that the key tests that pertain to your main hypothesis or the main novel finding should be (not necessarily for the whole brain). But if for instance you just want to show that you reproduced some well known results from the literature then I'd be happy with an uncorrected test. Also if you have done a corrected test at the group level and then want to demonstrate that the effect is there in every subject, I also wouldn't mind. If you think about it just from the probabilistic point of view reproducing an effect with a conservative threshold in each and every subject is unlikely.

Also as was mentioned before there are some cases where false negatives have high cost and then also sensitivity should be increased at the expense of specificity.

The technical point I was trying to make is that for p<0.001 uncorrected you don't really know your false positive rate. In some cases when the data is very smooth it might actually be more conservative than a FWE-corrected test with p<0.05 and in other cases unacceptably lenient. So the question I would ask is why there is a tradition to do p<0.001 uncorrected rather than lets say p<0.2 FWE corrected? The way that I usually interpret a p<0.05 uncorrected image is that I can't say much about what is there but for things which are not there, there is not much evidence in the data. But this interpretation is not founded mathematically because there is no way to accept the null hypothesis.

Regarding your idea of effect sizes with confidence intervals that's very similar to a PPM which is available in SPM but not hugely popular as far as I know.

Finally I should note that this discussion drifted away slightly from the original questions asked by Sun in the context of ERP analysis. I'm not an fMRI expert and I've never done, published or reviewed an fMRI study. So all I know about community standards and the like is from what I hear as a member of the FIL methods group. But it is my impression that the fMRI field is mature enough so it can afford the kind of discussion Sherif initiated. In the M/EEG field which I'm more familiar with things look rather different and we are struggling to convince people that correcting for multiple comparisons is the proper thing to do in the first place. It is still a common practice to look at the data, select the electrode and time window with the largest effect and then test for it in SPSS (or a more sophisticated variant of that use SPM as an 'exploratory' tool as suggested by Sun). This is clearly invalid and wrong and that's the main point of my message. Another common pitfall in M/EEG community that Sun's question exemplified is that people invent their own statistical criteria based on the idea that something sounds unlikely without actually quantifying how unlikely it is. For instance they can say that if a a given voxel is in 90th percentile of all voxels for 5 consecutive time frames than it's significant and build a whole theory about cortical networks based on this very dubious criterion without ever asking themselves what their null hypothesis is and how their statistic is distributed under the null. So these are the kind of things we are trying to educate people about at the moment and it is very different from the discussions of fMRI people about when it can be OK deviate from the widely accepted FWE correction.

Best,

Vladimir

Sent from my iPad

On 30 Oct 2010, at 23:37, Sherif Karama <[log in to unmask]> wrote:

Thank you for the reference; interesting read indeed.

The example you provide emphasizes clearly the need for sound judgment calls. I would proceed exactly as you have. A similar 'statistical leniency' is observed in testing new drugs for potential detrimental or even lethal effects. This being said, the danger of playing with thresholds, of course, which is alluded to in your reference, would be to have each researcher use various thresholds to suit his or her purposes. This could quicly lead to having the term 'statistical significance' become meaningless.

Sherif

On Sat, Oct 30, 2010 at 6:03 PM, Watson, Christopher <[log in to unmask]> wrote:

In regards to your comment that the 0.05 cutoff is arbitrary, I found this document an interesting read: http://tinyurl.com/334jmyh

I think the choice to correct or not depends on what you're doing. For example, when I do a pre-surgical fMRI, we will often send the uncorrected results to the surgeon, as I wouldn't want to risk a region that is involved in the function of interest failing to survive multiple comparison correction. It certainly wouldn't be good for the patient...
________________________________________
From: SPM (Statistical Parametric Mapping) [[log in to unmask]] On Behalf Of Sherif Karama [[log in to unmask]]
Sent: Saturday, October 30, 2010 12:02 PM
To: [log in to unmask]
Subject: Re: [SPM] [ERP] Significance level and correction for multiple comparison

Dear Vladimir,
Thank you for taking the time to respond. We seem to share a very similar philosophy here and I will add that, to date, I have only published findings using corrected thresholds (whether whole brain-corrected or using small volume corrections). With this in mind, I would nonetheless want to pursue this interesting and, I believe, worthwhile exchange of points of view a little further if you don't mind. I have been wanting to discuss this for a long time and hope that this is the proper venue to do so.
I'll grant you that, obviously, statistics is a way of decision making under uncertainty but ultimately, its aim is nonetheless, as you yourself point out, to make the decision that leads to the best balance between say, type I and type II errors. As such, stating that it's "NOT about the truth" (which could be defined as 'true negatives' and 'true positives') while being conceptually correct, is stretching it a little as I see it. Anyway, while relevant to the discussion, I don't think we need to let this issue interfere with the points we are each trying to make.
In the last few years, I have tended to defend a thesis that echoed very closely your position that using too lenient thresholds would allow for too many false positives in the literature and therefore lead to a large amount of noise, making the building of theories rather difficult. However, are we not here implicitly saying that type I errors are worse than type II errors? I'm not sure we could defend this easily.
Before I go on, I'll emphasize that, as you know, the 0.05 cutoff that is a standard criterion in many fields (not all) is, in the end, an arbitrary cutoff.
This said, I do tend to believe that, in most instances, an uncorrected 0.001 threshold is too lenient and that we should, in the vast majority of cases, be using corrected thresholds. However, in a hypothetical situation where 20 independent fMRI papers (or perhaps even a good meta-analysis) have looked at a given cognitive or other process using 'appropriately' corrected thresholds and reported say, 12 regions being systematically activated, I would tend to view these as true positives. In light of this, if I were to conduct a study and find 15 regions/clusters of activation using an uncorrected 0.001 threshold with 11 of these being essentially the same as the 12 that were systematically reported in the literature, I would be very uncomfortable not to consider them true positives even if they did not survive a whole-brain correction. This said, I would very likely not consider the remaining 4 regions out of the 15 as true positives if they did not survive a whole-brain correction and would therefore be using priors in my decision process. Now, I'll restate that I believe that in most instance we should be using corrected thresholds but in the end, I'll contend that it comes down to a judgment call made on a case by case basis that cannot easily be reduced to what appears to me to be a somewhat Procrustean solution of exclusively using corrected thresholds for all studies.
You state that it essentially trickles down to a community standard. As I can observe, many fMRI papers have been and are being published in HBM, NeuroImage, Brain, and Nature Neuroscience using uncorrected thresholds so what, exactly, is the community standard?
Ultimately, I think we are tripping on an issue of statistical power. I tend to believe that a rather significant percentage of individual brain imaging studies are underpowered (optimal and powerful designs are, at times, prohibitive due to psychological or other constraints). Perhaps a solution might be to devise a scheme to report effect size brain maps with confidence intervals (I know this is impractical but I wanted to put it out there).
I'll admit that the idea of adding another layer of correction which would take into account all tests implemented in a paper or between different variants of the attempted analyses is an idea that has frequently crossed my mind. However, I can't stop myself from pushing this further and imagining therefore applying corrections that would take into account all the published papers using similar analyses with the very likely impact of having nothing surviving... ever ; ).
I'll finish with a question which pertains to a current situation I am struggling with. I have recently conducted a study in order to examine a certain process and used different methods in different runs that aimed at eliciting this process. My aim is now to use a conjunction-null analysis to look at areas that are commonly activated in each of the, let's say, 3 methods/runs. To me, using a FWE-corrected 0.05 threshold for a conjunction null analysis across all three conditions is much too stringent. As I have strong a priori hypotheses based on a large number of studies as well as corroborating results from a meta-analysis, I decided to explore the data using an uncorrected 0.001 threshold for the conjunction null (which, by the way, gives me almost identical results to the global conjunction analysis using a FWE-corrected 0.05 threshold). Now, for simplicity's sake, I felt that presenting results from the individual studies using the same (i.e. uncorrected 0.001) made most sense given that using a 0.05 FWE correction for the individual methods and then an uncorrected 0.001 threshold for the conjunction null would be confusing as we would observe regions not activated for the individual studies that would nonetheless be observed for the conjunction null. I am considering presenting the uncorrected 0.001 results of the individual runs as trends for those who do not make the FWE-corrected threshold for the a priori determined ROI as the vast majority (about 90%) of observed foci fit well with the findings of the meta-analysis with few findings outside of these a priori ROI. Obviously, the non a priori determined observed regions would be indentified as such with the caveat that they are likely false positives. What would you do?
Best,
Sherif

On Fri, Oct 29, 2010 at 2:04 PM, Vladimir Litvak <[log in to unmask]<mailto:[log in to unmask]>> wrote:
On Fri, Oct 29, 2010 at 1:53 AM, Sherif Karama <[log in to unmask]<mailto:[log in to unmask]>> wrote:

> I agree with almost everything you wrote but I do have a comment.
>
> In a situation where I am expecting, with a very high degree of probability,
> activation of the amygdala (for example) and yet expect (although with
> lesser conviction) activations in many regions throughout the brain, the
> situation becomes rapidly complex.
>
> If one is looking only at the amygdala, one would be justified in using a
> small volume correction perhaps. But if one is looking at the whole brain
> including the amygdala, then it can perhaps be argued that whole brain
> corrections are needed. However, this last correction would not take into
> account the increased expectancy of amygdala activation. So an alternative
> may be to use modulated/different thresholds which would be likely viewed as
> very unelegant. Although somewhat of a Bayesian approach, here again one
> would be faced with quantifying regional expectancy (which can be
> very tricky business). It is for such reasons that I do consider findings
> from uncorrected thresholds sometimes meaningful when well justified. Here
> I am thinking of 0.001 or something like this which provides a certain
> degree of protection against false positives but also allowing for weak but
> real signals to emerge. Perhaps it's this kind of thinking that has led SPM
> creators to use a 0.001 threshold as default when one presses on
> uncorrected?
>
> Am any of this making sense to you?
>

I understand your problem but I don't think using uncorrected
thresholds are really the solution to it. For the specific example you
give I think doing small volume correction for the amygdala and then
normal FWE correction for the rest of the brain is a valid and elegant
enough solution. If you have varying degrees of prior confidence that
would indeed require a Bayesian approach, but I don't think many
people can really quantify their degree of prior belief for different
areas, unless it is done with some kind empirical Bayesian
formulation.

Statistics is not about the truth but it is a way of decision making
under uncertainty. And the optimal way to make such decisions depends
on what degree of error of each type we are willing to tolerate. I
would argue that although in the short term one is eager to publish a
paper with some significant finding, using very liberal thresholds is
damaging in the long term. You will eventually have to reconcile your
findings with the previous literature which might be very difficult if
this literature is full of false positives. Also building any theories
is made difficult by the high level of 'noise'. Eventually not being
conservative enough can ruin the credibility of the whole field.

The problem with uncorrected thresholds is that you can't even
immediately quantify your false positive rate because it depends on
things like the number of voxels and degree of smoothing. I think the
reason the uncorrected option is there is because some people use it
for display and for diagnostics. Also there are many ways to define
significance and if one was only allowed to see an image after
specifying exactly the small volume or the cluster-level threshold
it'd make the user interface more complicated.

Try adding random regressors to your design and testing for them with
uncorrected threshold to convince yourself that there is a problem
there. With that said it's all a matter of community standard. Ffor
instance a purist would also do a Bonferroni correction between all
the tests reported in a paper or even between all the different
variants of the analysis attempted. But I don't know many people who
do it ;-)

Best,

Vladimir

>
> On Thu, Oct 28, 2010 at 5:37 PM, Vladimir Litvak <[log in to unmask]<mailto:[log in to unmask]>>

> wrote:
>>
>> Just to add something to my previous answer, you can look up in the
>> 'cluster-level' part of the table what is the size of the smallest
>> significant cluster and then press 'Results' again and use that number
>> as your extent threshold. Then you'll get a MIP image with just the
>> significant clusters which is what you want.
>>
>> Vladimir
>>
>> On Thu, Oct 28, 2010 at 3:51 PM, Vladimir Litvak

>> <[log in to unmask]<mailto:[log in to unmask]>> wrote:
>> > Dear Sun,
>> >

>> > On Thu, Oct 28, 2010 at 3:32 PM, Sun Delin <[log in to unmask]<mailto:[log in to unmask]>> wrote:
>> >> Dear Vladimir,
>> >>
>> >> Thank you so much for the detailed reply. Could I conclude your
>> >> replies as follows?
>> >> 1. Try to do correction for multiple comparisons to avoid false
>> >> positive.
>> >> 2. If there is no hypothesis IN ADVANCE, SPM is better than SPSS
>> >> because the former can provide a significant map with both temporal and
>> >> spatial information.
>> >> 3. Use small time window of interest to do analysis.
>> >
>> > This is all correct.
>> >
>> >
>> >> 4. Cluster-level inference is welcome, so large extent threshold is
>> >> good.
>> >>
>> >
>> > You don't need to put any extent threshold to do cluster-level
>> > inference. What you should do is present the results uncorrected, lets
>> > say at 0.05. Then press 'whole brain' to get the stats table and look
>> > under where it says 'cluster-level'. You will see a column with title
>> > 'p FWE-corr' (third column from the left of the table). This is the
>> > column you should look at and if there is something below p = 0.05
>> > there you can report it saying that it was significant FWE-corrected
>> > at the cluster level. You can use higher extent threshold if you get
>> > many small clusters that you want to get rid of.
>> >
>> >> However, I would still like to ask more clearly
>> >> 1. If there is no significance left (I am often unlucky to meet such
>> >> results) after correction for multiple comparisons (FWE or FDR), could I use
>> >> uncorrected p value (p < 0.05) with large extent threshold such as k > 400?
>> >> Because it seems impossible that more than 400 adjacent voxels are all false
>> >> positive. If you are the reviewer, could you accept that result?
>> >
>> > No. You can't do it like that because although it is improbable you
>> > can't put a number on how improbable it is. What you should do is look
>> > in the stats table as I explained above.
>> >
>> >> 2. You said that it is "absolutely statistically invalid thing to do is
>> >> to find an uncorrected effect in SPM and then go and
>> >> test the same channel and time window in SPSS." However, I found that
>> >> if the uncorrected effect (e.g. p < 0.05 uncorrected, k > 400) appeared at
>> >> some sites in SPM, SPSS analysis involving the same channel and time window
>> >> would show a more significant result. Because most ERP researchers now
>> >> accept the results by SPSS, is it a way to use SPM as a guide to show the
>> >> possible significant ROI (temporally and spatially) and use SPSS to get the
>> >> statistical significance?
>> >
>> > No that's exactly the thing that is wrong. You can only use SPSS if
>> > you have an a-priori hypothesis. As I explained you will get more
>> > significant results in SPSS than in SPM because SPSS assumes
>> > (incorrectly in your case) that you are only doing a single point test
>> > and it doesn't know about all the other points you tried to test in
>> > SPM whereas SPM does know about them and corrects for this.
>> >
>> >> 3. If the small time window of interest is more sensitive, could I use
>> >> several consecutive small time window (e.g. 50 ms) of interest to analysis
>> >> long component such as LPC (I know some researchers use consecutive time
>> >> window to analysis LPC component by SPSS) or as an exploring tool to
>> >> investigate the possible significant result on dataset without hypothesis IN
>> >> ADVANCE?
>> >
>> > If the windows are consecutive (i.e. there are no gaps between them)
>> > then you should just take one long window. If there are gaps you can
>> > use a mask image that will mask those gaps out and SPM will
>> > automatically account for the multiple windows.
>> >
>> >> 4. Because of the head shape and some other reasons, the 2D projection
>> >> map of each individual' sensors on scalp is some different from the standard
>> >> template provided by SPM. Is it correct to put each subjects' images based
>> >> on their own 2D sensors' map into the GLM model for specification, or use
>> >> images based on standard 2D sensors' map instead? I have tested both ways
>> >> and found that the previous method may lead to some stripe like significance
>> >> at the border of mask. I do no know why.
>> >
>> > Both ways are possible. You can either mask out the borders if you
>> > know there is a problem there or use standard locations for all
>> > subjects.
>> >
>> > Best,
>> >
>> > Vladimir
>> >
>> >
>> >>
>> >> Sorry for asking some weak questions, however, I really like the
>> >> EEG/MEG module of SPM8.
>> >>
>> >> Bests,
>> >> Sun Delin
>> >>
>> >>
>> >
>
>