Print

Print


Dear Nick,

I think I will start from the end of your mail

> Ideally, we would want a registration algorithm which explicitly aligns anatomy.  Unfortunately, that is a really hard problem so we depend on heuristics (e.g., dark intensities should correspond to dark intensities) as surrogate information in the form of intensity similarity metrics.  Our contention is that the SSD and Demons metrics are particularly problematic in that they align voxels in a biased (or contaminated) way by explicitly decreasing the average voxelwise variance (which does not necessarily reflect increased anatomical correspondence).   That is why we advocate a data selection strategy in which normalization is driven via images which are independent of the images used for statistical analysis.  In response, one might suggest as an alternative hypothesis, as one of our later reviewers did, that the SSD metric is much more sensitive to clinically relevant differences.  We addressed this in Footnote 4:
> 
>> One reviewer suggested the possibility that the increase in statistical significance produced 
>> using the SSD and Demons metrics (vs. MI or CC) 

I think this is where we might have a misunderstanding. You don’t show an increase in statistical significance in your paper. What you show is that you see more large t-values, and that is not the same thing. We need to distinguish between calculating the statistic and testing the statistic. What I mean by “testing the statistic” is the t->p transform. And it is the p-values that we are really interested in since it is those that determine if we reject the null-hypothesis (i.e. report a difference) 

You assume that the null-distribution is the same for all cases (SSD, Demons etc) and therefore you assume that when you see a larger t-value that means a smaller p-value. You should not make that assumption.

Importantly permutation testing does not make that assumption. It is, in principle, possible that a t-value of 4.5 achieved by MI corresponds to a (corrected) p=0.05 but that a t-value of 5.5 achieved by SSD also corresponds to p=0.05. Hence, one cannot draw any conclusions from just looking at the t-values. And to be extra clear, nor can one from comparing those t-values to the parametric t-distribution.

> I like what you said:  "If the “problem" is control of false positives, then permutation testing _is_ the remedy."  I agree that permutation testing is a standard mechanism for controlling false positives during the statistical analysis of one's experiment.  If a scientist collects samples from two groups (e.g., A & B) and is looking to determine if there are differences between those two groups, all the considerations you mention are important (including permutation testing) to avoid false positives. However, and I think this is crucial for understanding what we are saying in our paper, *the statistical analysis is not the only possible source of false positives.  False positives can also occur if data selection is performed in a biased way* (e.g., the scientist unknowingly collects the 'A' samples from a contaminated source).  Obviously, this type of bias is not going to be corrected via permutation testing.  
> 
> So, going back to your example, it is not that one is "nudging samples" from the sets A and B in a way that is corrected with permutation testing.

Just so we are on the same page. Permutation testing does _not_ correct your statistic (t-values). It builds a distribution that allows you to do the t->p transformation. The way it is typically used is to calculate a threshold above which one can reject the null-hypothesis for some specified false positive rate.

> Rather, depending on how one does the spatial normalization, one is going to get unique sets A and B at each voxel for each normalization configuration.  For example, in our paper, we look at the metrics SSD, Demons, MI, and CC and end up with the voxelwise sets:
> 
> A_{SSD}       vs. B_{SSD}
> A_{Demons} vs. B_{Demons}
> A_{MI}         vs. B_{MI}
> A_{CC}         vs. B_{CC}

I am sorry, but I can’t really do anything else than repeat myself here. If the nudging, registration etc is not informed of the design, it _cannot_ cause a bias. It _can_ change the distribution of your statistic, which can in turn lead to heavier tails (which you will observe as more large t-values among your voxels). But this does not lead to loss of control of false positives. So, nothing of this poses any problems for users of TBSS and false positives will be controlled.

I think I am running out of ways of explaining this now and may need to leave this for someone else to explain more clearly.

Jesper and Steve