Print

Print


Dear Vladimir,

Oh!  I hadn't realized that the situation for M/EEG was as you described
it.  I clearly see your point(s) now and believe we're in perfect agreement.

Thanks for pointing this out,

Sherif

On Sat, Oct 30, 2010 at 7:44 PM, Vladimir Litvak
<[log in to unmask]>wrote:

>  Dear Sherif,
>
> You are making interesting points and I would say that this kind of issues
> are the reason papers are reviewed by your peer researchers and not by
> computer programs that check that p<0.05 for all your tests.
>
> I've never said that all statistical images should be FWE corrected but I
> think that the key tests that pertain to your main hypothesis or the main
> novel finding should be (not necessarily for the whole brain). But if for
> instance you just want to show that you reproduced some well known results
> from the literature then I'd be happy with an uncorrected test. Also if you
> have done a corrected test at the group level and then want to demonstrate
> that the effect is there in every subject, I also wouldn't mind. If you
> think about it just from the probabilistic point of view reproducing an
> effect with a conservative threshold in each and every subject is unlikely.
>
> Also as was mentioned before there are some cases where false negatives
> have high cost and then also sensitivity should be increased at the expense
> of specificity.
>
> The technical point I was trying to make is that for p<0.001 uncorrected
> you don't really know your false positive rate. In some cases when the data
> is very smooth it might actually be more conservative than a FWE-corrected
> test with p<0.05 and in other cases unacceptably lenient. So the question I
> would ask is why there is a tradition to do p<0.001 uncorrected rather than
> lets say p<0.2 FWE corrected? The way that I usually interpret a p<0.05
> uncorrected image is that I can't say much about what is there but for
> things which are not there, there is not much evidence in the data. But this
> interpretation is not founded mathematically because there is no way to
> accept the null hypothesis.
>
> Regarding your idea of effect sizes with confidence intervals that's very
> similar to a PPM which is available in SPM but not hugely popular as far as
> I know.
>
> Finally I should note that this discussion drifted away slightly from the
> original questions asked by Sun in the context of ERP analysis. I'm not an
> fMRI expert and I've never done, published or reviewed an fMRI study. So all
> I know about community standards and the like is from what I hear as a
> member of the FIL methods group.  But it is my impression that the fMRI
> field is mature enough so it can afford the kind of discussion Sherif
> initiated. In the M/EEG field which I'm more familiar with things look
> rather different and we are struggling to convince people that correcting
> for multiple comparisons is the proper thing to do in the first place. It is
> still a common practice to look at the data, select the electrode and time
> window with the largest effect and then test for it in SPSS (or a more
> sophisticated variant of that use SPM as an 'exploratory' tool as suggested
> by Sun). This is clearly invalid and wrong and that's the main point of my
> message. Another common pitfall in M/EEG community that Sun's question
> exemplified is that people invent their own statistical criteria based on
> the idea that something sounds unlikely without actually quantifying how
> unlikely it is. For instance they can say that if a a given voxel is in 90th
> percentile of all voxels for 5 consecutive time frames than it's significant
> and build a whole theory about cortical networks based on this very dubious
> criterion without ever asking themselves what their null hypothesis is and
> how their statistic is distributed under the null. So these are the kind of
> things we are trying to educate people about at the moment and it is very
> different from the discussions of fMRI people about when it can be OK
> deviate from the widely accepted FWE correction.
>
> Best,
>
> Vladimir
>
>
> Sent from my iPad
>
> On 30 Oct 2010, at 23:37, Sherif Karama <[log in to unmask]> wrote:
>
>   Thank you for the reference; interesting read indeed.
>
> The example you provide emphasizes clearly the need for sound judgment
> calls.   I would proceed exactly as you have.  A similar 'statistical
> leniency' is observed in testing new drugs for potential detrimental or even
> lethal effects.  This being said, the danger of playing with thresholds, of
> course, which is alluded to in your reference, would be to have
> each researcher use various thresholds to suit his or her purposes.
> This could quicly lead to having the term 'statistical significance' become
> meaningless.
>
>
> Sherif
>
>
>
> On Sat, Oct 30, 2010 at 6:03 PM, Watson, Christopher <
> [log in to unmask]> wrote:
>
>> In regards to your comment that the 0.05 cutoff is arbitrary, I found this
>> document an interesting read: http://tinyurl.com/334jmyh
>>
>> I think the choice to correct or not depends on what you're doing. For
>> example, when I do a pre-surgical fMRI, we will often send the uncorrected
>> results to the surgeon, as I wouldn't want to risk a region that is involved
>> in the function of interest failing to survive multiple comparison
>> correction. It certainly wouldn't be good for the patient...
>> ________________________________________
>> From: SPM (Statistical Parametric Mapping) [[log in to unmask]] On Behalf
>> Of Sherif Karama [[log in to unmask]]
>> Sent: Saturday, October 30, 2010 12:02 PM
>> To: [log in to unmask]
>> Subject: Re: [SPM] [ERP] Significance level and correction for multiple
>> comparison
>>
>> Dear Vladimir,
>> Thank you for taking the time to respond.  We seem to share a very similar
>> philosophy here and I will add that, to date, I have only published findings
>> using corrected thresholds (whether whole brain-corrected or using small
>> volume corrections).   With this in mind, I would nonetheless want to pursue
>> this interesting and, I believe, worthwhile exchange of points of view a
>> little further if you don't mind.   I have been wanting to discuss this for
>> a long time and hope that this is the proper venue to do so.
>> I'll grant you that, obviously, statistics is a way of decision making
>> under uncertainty but ultimately, its aim is nonetheless, as you yourself
>> point out, to make the decision that leads to the best balance between say,
>> type I and type II errors.  As such, stating that it's "NOT about the truth"
>> (which could be defined as 'true negatives' and 'true positives') while
>> being conceptually correct, is stretching it a little as I see it.  Anyway,
>> while relevant to the discussion, I don't think we need to let this issue
>> interfere with the points we are each trying to make.
>> In the last few years, I have tended to defend a thesis that echoed very
>> closely your position that using too lenient thresholds would allow for too
>> many false positives in the literature and therefore lead to a large amount
>> of noise, making the building of theories rather difficult.  However, are we
>> not here implicitly saying that type I errors are worse than type II errors?
>>  I'm not sure we could defend this easily.
>> Before I go on, I'll emphasize that, as you know, the 0.05 cutoff that is
>> a standard criterion in many fields (not all) is, in the end, an arbitrary
>> cutoff.
>> This said, I do tend to believe that, in most instances, an uncorrected
>> 0.001 threshold is too lenient and that we should, in the vast majority of
>> cases, be using corrected thresholds.  However, in a hypothetical situation
>> where 20 independent fMRI papers (or perhaps even a good meta-analysis) have
>> looked at a given cognitive or other process using 'appropriately' corrected
>> thresholds and reported say, 12 regions being systematically activated,  I
>> would tend to view these as true positives.  In light of this, if I were to
>> conduct a study and find 15 regions/clusters of activation using an
>> uncorrected 0.001 threshold with 11 of these being essentially the same as
>> the 12 that were systematically reported in the literature, I would be very
>> uncomfortable not to consider them true positives even if they did not
>> survive a whole-brain correction.   This said, I would very likely not
>> consider the remaining 4 regions out of the 15 as true positives if they did
>> not survive a whole-brain correction and would therefore be using priors in
>> my decision process.  Now, I'll restate that I believe that in most instance
>> we should be using corrected thresholds but in the end, I'll contend that it
>> comes down to a judgment call made on a case by case basis that cannot
>> easily be reduced to what appears to me to be a somewhat Procrustean
>> solution of exclusively using corrected thresholds for all studies.
>> You state that it essentially trickles down to a community standard.  As I
>> can observe, many fMRI papers have been and are being published in HBM,
>> NeuroImage, Brain, and Nature Neuroscience using uncorrected thresholds so
>> what, exactly, is the community standard?
>> Ultimately, I think we are tripping on an issue of statistical power.  I
>> tend to believe that a rather significant percentage of individual brain
>> imaging studies are underpowered (optimal and powerful designs are, at
>> times, prohibitive due to psychological or other constraints).  Perhaps a
>> solution might be to devise a scheme to report effect size brain maps with
>> confidence intervals (I know this is impractical but I wanted to put it out
>> there).
>> I'll admit that the idea of adding another layer of correction which would
>> take into account all tests implemented in a paper or between different
>> variants of the attempted analyses is an idea that has frequently crossed my
>> mind.  However, I can't stop myself from pushing this further and imagining
>> therefore applying corrections that would take into account all the
>> published papers using similar analyses with the very likely impact of
>> having nothing surviving... ever ; ).
>> I'll finish with a question which pertains to a current situation I am
>> struggling with.  I have recently conducted a study in order to examine a
>> certain process and used different methods in different runs that aimed at
>> eliciting this process.  My aim is now to use a conjunction-null analysis to
>> look at areas that are commonly activated in each of the, let's say, 3
>> methods/runs.   To me, using a FWE-corrected 0.05 threshold for a
>> conjunction null analysis across all three conditions is much too stringent.
>>  As I have strong a priori hypotheses based on a large number of studies as
>> well as corroborating results from a meta-analysis, I decided to explore the
>> data using an uncorrected 0.001 threshold for the conjunction null (which,
>> by the way, gives me almost identical results to the global conjunction
>> analysis using a FWE-corrected 0.05 threshold).  Now, for simplicity's sake,
>> I felt that presenting results from the individual studies using the same
>> (i.e. uncorrected 0.001) made most sense given that using a 0.05 FWE
>> correction for the individual methods and then an uncorrected 0.001
>> threshold for the conjunction null would be confusing as we would observe
>> regions not activated for the individual studies that would nonetheless be
>> observed for the conjunction null.   I am considering presenting the
>> uncorrected 0.001 results of the individual runs as trends for those who do
>> not make the FWE-corrected threshold for the a priori determined ROI as the
>> vast majority (about 90%) of observed foci fit well with the findings of the
>> meta-analysis with few findings outside of these a priori ROI.  Obviously,
>> the non a priori determined observed regions would be indentified as such
>> with the caveat that they are likely false positives.  What would you do?
>> Best,
>> Sherif
>>
>>
>>  On Fri, Oct 29, 2010 at 2:04 PM, Vladimir Litvak <
>> [log in to unmask]<mailto:[log in to unmask]>> wrote:
>> On Fri, Oct 29, 2010 at 1:53 AM, Sherif Karama <[log in to unmask]
>> <mailto:[log in to unmask]>> wrote:
>>
>> > I agree with almost everything you wrote but I do have a comment.
>> >
>> > In a situation where I am expecting, with a very high degree of
>> probability,
>> > activation of the amygdala (for example) and yet expect (although with
>> > lesser conviction) activations in many regions throughout the brain, the
>> > situation becomes rapidly complex.
>> >
>> > If one is looking only at the amygdala, one would be justified in using
>> a
>> > small volume correction perhaps.  But if one is looking at the whole
>> brain
>> > including the amygdala, then it can perhaps be argued that whole brain
>> > corrections are needed.  However, this last correction would not take
>> into
>> > account the increased expectancy of amygdala activation.  So an
>> alternative
>> > may be to use modulated/different thresholds which would be likely
>> viewed as
>> > very unelegant.  Although somewhat of a Bayesian approach, here again
>> one
>> > would be faced with quantifying regional expectancy (which can be
>> > very tricky business).  It is for such reasons that I do consider
>> findings
>> > from uncorrected thresholds sometimes meaningful when well justified.
>>  Here
>> > I am thinking of 0.001 or something like this which provides a certain
>> > degree of protection against false positives but also allowing for weak
>> but
>> > real signals to emerge.  Perhaps it's this kind of thinking that has led
>> SPM
>> > creators to use a 0.001 threshold as default when one presses on
>> > uncorrected?
>> >
>> > Am any of this making sense to you?
>> >
>>
>>
>> I understand your problem but I don't think using uncorrected
>> thresholds are really the solution to it. For the specific example you
>> give I think doing small volume correction for the amygdala and then
>> normal FWE correction for the rest of the brain is a valid and elegant
>> enough solution.  If you have varying degrees of prior confidence that
>> would indeed require a Bayesian approach, but I don't think many
>> people can really quantify their degree of prior belief for different
>> areas, unless it is done with some kind empirical Bayesian
>> formulation.
>>
>> Statistics is not about the truth but it is a way of decision making
>> under uncertainty. And the optimal way to make such decisions depends
>> on what degree of error of each type we are willing to tolerate. I
>> would argue that although in the short term one is eager to publish a
>> paper with some significant finding, using very liberal thresholds is
>> damaging in the long term. You will eventually have to reconcile your
>> findings with the previous literature which might be very difficult if
>> this literature is full of false positives. Also building any theories
>> is made difficult by the high level of 'noise'. Eventually not being
>> conservative enough can ruin the credibility of the whole field.
>>
>> The problem with uncorrected thresholds is that you can't even
>> immediately quantify your false positive rate because it depends on
>> things like the number of voxels and degree of smoothing. I think the
>> reason the uncorrected option is there is because some people use it
>> for display and for diagnostics. Also there are many ways to define
>> significance and if one was only allowed to see an image after
>> specifying exactly the small volume or the cluster-level threshold
>> it'd make the user interface more complicated.
>>
>> Try adding random regressors to your design and testing for them with
>> uncorrected threshold to convince yourself that there is a problem
>> there. With that said it's all a matter of community standard. Ffor
>> instance a purist would also do a Bonferroni correction between all
>> the tests reported in a paper or even between all the different
>> variants of the analysis attempted. But I don't know many people who
>> do it ;-)
>>
>> Best,
>>
>> Vladimir
>>
>>
>>
>>
>>
>> >
>> > On Thu, Oct 28, 2010 at 5:37 PM, Vladimir Litvak <
>> [log in to unmask]<mailto:[log in to unmask]>>
>> > wrote:
>> >>
>> >> Just to add something to my previous answer, you can look up in the
>> >> 'cluster-level' part of the table what is the size of the smallest
>> >> significant cluster and then press 'Results' again and use that number
>> >> as your extent threshold. Then you'll get a MIP image with just the
>> >> significant clusters which is what you want.
>> >>
>> >> Vladimir
>> >>
>> >> On Thu, Oct 28, 2010 at 3:51 PM, Vladimir Litvak
>>  >> <[log in to unmask]<mailto:[log in to unmask]>> wrote:
>> >> > Dear Sun,
>> >> >
>>  >> > On Thu, Oct 28, 2010 at 3:32 PM, Sun Delin <[log in to unmask]<mailto:
>> [log in to unmask]>> wrote:
>> >> >> Dear Vladimir,
>> >> >>
>> >> >>    Thank you so much for the detailed reply. Could I conclude your
>> >> >> replies as follows?
>> >> >> 1. Try to do correction for multiple comparisons to avoid false
>> >> >> positive.
>> >> >> 2. If there is no hypothesis IN ADVANCE, SPM is better than SPSS
>> >> >> because the former can provide a significant map with both temporal
>> and
>> >> >> spatial information.
>> >> >> 3. Use small time window of interest to do analysis.
>> >> >
>> >> > This is all correct.
>> >> >
>> >> >
>> >> >> 4. Cluster-level inference is welcome, so large extent threshold is
>> >> >> good.
>> >> >>
>> >> >
>> >> > You don't need to put any extent threshold to do cluster-level
>> >> > inference. What you should do is present the results uncorrected,
>> lets
>> >> > say at 0.05. Then press 'whole brain' to get the stats table and look
>> >> > under where it says 'cluster-level'. You will see a column with title
>> >> > 'p FWE-corr' (third column from the left of the table). This is the
>> >> > column you should look at and if there is something below p = 0.05
>> >> > there you can report it saying that it was significant FWE-corrected
>> >> > at the cluster level. You can use higher extent threshold if you get
>> >> > many small clusters that you want to get rid of.
>> >> >
>> >> >>    However, I would still like to ask more clearly
>> >> >> 1. If there is no significance left (I am often unlucky to meet such
>> >> >> results) after correction for multiple comparisons (FWE or FDR),
>> could I use
>> >> >> uncorrected p value (p < 0.05) with large extent threshold such as k
>> > 400?
>> >> >> Because it seems impossible that more than 400 adjacent voxels are
>> all false
>> >> >> positive. If you are the reviewer, could you accept that result?
>> >> >
>> >> > No. You can't do it like that because although it is improbable you
>> >> > can't put a number on how improbable it is. What you should do is
>> look
>> >> > in the stats table as I explained above.
>> >> >
>> >> >> 2. You said that it is "absolutely statistically invalid thing to do
>> is
>> >> >> to find an uncorrected effect in SPM and then go and
>> >> >> test the same channel and time window in SPSS." However, I found
>> that
>> >> >> if the uncorrected effect (e.g. p < 0.05 uncorrected, k > 400)
>> appeared at
>> >> >> some sites in SPM, SPSS analysis involving the same channel and time
>> window
>> >> >> would show a more significant result. Because most ERP researchers
>> now
>> >> >> accept the results by SPSS, is it a way to use SPM as a guide to
>> show the
>> >> >> possible significant ROI (temporally and spatially) and use SPSS to
>> get the
>> >> >> statistical significance?
>> >> >
>> >> > No that's exactly the thing that is wrong. You can only use SPSS if
>> >> > you have an a-priori hypothesis. As I explained you will get more
>> >> > significant results in SPSS than in SPM because SPSS assumes
>> >> > (incorrectly in your case) that you are only doing a single point
>> test
>> >> > and it doesn't know about all the other points you tried to test in
>> >> > SPM whereas SPM does know about them and corrects for this.
>> >> >
>> >> >> 3. If the small time window of interest is more sensitive, could I
>> use
>> >> >> several consecutive small time window (e.g. 50 ms) of interest to
>> analysis
>> >> >> long component such as LPC (I know some researchers use consecutive
>> time
>> >> >> window to analysis LPC component by SPSS) or as an exploring tool to
>> >> >> investigate the possible significant result on dataset without
>> hypothesis IN
>> >> >> ADVANCE?
>> >> >
>> >> > If the windows are consecutive (i.e. there are no gaps between them)
>> >> > then you should just take one long window. If there are gaps you can
>> >> > use a mask image that will mask those gaps out and SPM will
>> >> > automatically account for the multiple windows.
>> >> >
>> >> >> 4. Because of the head shape and some other reasons, the 2D
>> projection
>> >> >> map of each individual' sensors on scalp is some different from the
>> standard
>> >> >> template provided by SPM. Is it correct to put each subjects' images
>> based
>> >> >> on their own 2D sensors' map into the GLM model for specification,
>> or use
>> >> >> images based on standard 2D sensors' map instead? I have tested both
>> ways
>> >> >> and found that the previous method may lead to some stripe like
>> significance
>> >> >> at the border of mask. I do no know why.
>> >> >
>> >> > Both ways are possible. You can either mask out the borders if you
>> >> > know there is a problem there or use standard locations for all
>> >> > subjects.
>> >> >
>> >> > Best,
>> >> >
>> >> > Vladimir
>> >> >
>> >> >
>> >> >>
>> >> >>    Sorry for asking some weak questions, however, I really like the
>> >> >> EEG/MEG module of SPM8.
>> >> >>
>> >> >> Bests,
>> >> >> Sun Delin
>> >> >>
>> >> >>
>> >> >
>> >
>> >
>>
>>
>