Sincere thanks to everyone who responded. I deliberately simulated data from two normal populations with the same variance with n1 = 5 and n2 = 10. The ratio of the larger IQR to the smaller IQR exceeded 2 in around one in three cases. Thus, in a scenario where a Mann-Whitney test would be appropriate (albeit inferior to a t-test), the guideline would "advise" against its use. This led me doubt its validity and post the query. Responses are given below.
I suspect that the guideline is not very good. However, it is
probably to do with heteroscedasticity and you have programmed
homoscedasticty into your simulation.
It is well-known that the two-sample t-test is not robust if the
population variances are different unless the samples sizes are the
same. If the smaller sample has the larger variance there is a
problem. I suspect that what is being done here is to give some
guidance to deal with possible differences in variances but I also
suspect that it is a pretty useless rule of thumb.
It might be worth looking at Gerald van Belle's
http://www.vanbelle.org/ book to see if there is a mention .
I have, and ignored it, for the good of the science.
I have never heard of this restriction on the Mann-Whitney.
Please let me know what the general opinion of your respondents is. I
would hazard a guess that whoever penned the "guideline" was trying to
be clever and "transfer" the guideline that a t-test should not be used
if the standard deviation of one sample is more than twice that of the
other, but I could be wrong!
From a theoretical perspective, I can see why this recommendation was made.
The Mann-Whitney test is a member of the class of permutation tests. This
class of tests has the property that, under the null hypothesis, all the
rearrangements of the data performed by the test must be equally likely.
This condition is met if, and only if, the data from the different groups
is drawn from the same error distribution. (I mean by this that the
distributions of scores within each group/condition should not differ in
any respect, except that of location). In the scenario where the spread
differs obviously (and substantially) between the groups, this condition is
clearly violated and the test degenerates into one that merely tests for
significant differences (OF ANY SORT) between the groups/conditions. Thus,
under these circumstances, the test ceases to be a simple test for differences
in location (medians).
I have not seen this guideline as such. However, I have long been aware
that the Mann-Whitney U-statistic, and the associated confidence
interval for the Hodges-Lehmann median difference, are robust to
non-Normality and non-robust to unequal variability. I have developed a
package (somersd) in the Stata statistical language to calculate
confidence intervals for rank statistics that are robust to unequal
variability. The theory is written up in Newson (2002), Newson (2006a)
and Newson (2006b), and also in some manuals distributed with the
package. All of these can be downloaded from my website (see my
signature below). If you have Stata, then you can download the package
by typing in Stata
ssc describe somersd
ssc install somersd, replace
I have done some simulations, and submitted the results for publication
in Computational Statistics and Data Analysis, on the performance, under
a wide range of scenarios, of various confidence intervals for median
differences (my package, the Lehmann formula, and the equal-variance and
unequal-variance t-tests. The message of these simulations is that the
method implemented in the somersd package is robust to non-Normality and
to unequal variability, at the price of being non-robust to tiny sample
numbers, under which conditions the confidence intervals may extend from
minus infinity to plus infinity. This is because, under those
conditions, if we are not allowed to assume Normality and/or equal
variability, then the median difference really could be anywhere.
Top of my head - MW is appropriate for a SHIFT model. For testing, the
null hypothesis is identical distributions i.e. shift zero. I did some
simulations years ago that convinced me that it was not robust to scale
differences - the size of the test can be pretty badly compromised,
especially when one population is contaminated with skew (holding the
But the twice-scale rule. Never heard of it.
I have not come across this guideline. However I have read a paper* which looks
at the performance of the Mann-Whitney test in a range of scenarios using
simulations and shows that it performs poorly when the variances in the two
groups are unequal.
I am curious about the simulation you did. Firstly, if the data are normal then
the t-test would be preferable to the Mann-Whitney test. Secondly, the guideline
you mentioned relates to unequal inter-quartile ranges and yet you simulated
data with equal variances (perhaps this was a typo?). I think it would be more
relevant to simulate non-normal data with a range of differences in the
inter-quartile range between groups to judge whether the guideline is
appropriate (depending on how much time you want to spend investigating this!).
* Skovlund E & Fenstad G (2001). Should we always choose a nonparametric test
when comparing two apparently nonnormal distributions?. Journal of Clinical
This is nonsense, as the Mann Whitney test can be used for ordinal
data. The notion of interquartile range involves subtraction, so can
apply only to interval data. However, there is a condition that if you
want to use the test as testing the null hypothesis that the medians are
the same, the two distributions must differ only in location. This
clearly applies to interval data only as ordinal data do not have a
shape. Under these circumstances the test also tests the null
hypothesis that the means are equal. As the standard deviations must be
identical, you might at well do a t test and get a confidence interval.
In theory, the bootstrap is the only technique that should be used to
compare the means of two populations that have quite different variances
(that is, the Behrens-Fisher problem). Student's t, a permutation test
using the original observations, and the permutation using ranks
(Mann-Whitney, Wilcoxon) are all likely to yield inexact significance
levels. Still simulations have shown that permutation tests are almost
exact even when the variance of one population is twice that of the
other. See http://statisticsonline.info/application.htm.
G Robin Henderson
Scottish National Stroke Audit
Royal Infirmary of Edinburgh
0131 242 6934
The information contained in this message may be confidential or
legally privileged and is intended for the addressee only. If you
have received this message in error or there are any problems
please notify the originator immediately. The unauthorised use,
disclosure, copying or alteration of this message is