Thanks to all the allstat members who replied to my earlier query. Your
responses have been very helpful and helped me in responding to the
proposal.
The original question was:
>I have just been reviewing a study proposal.
>
>Two (medical) treatments are to be compared, and the primary efficacy
>measure is a continuous variable expected to have a non-normal
>distribution.
>
>The primary analysis is to use a Wilcoxon test to compare the
>treatments.
>
>Sample size was calculated, correctly, to have 80% power when p1 =
>0.75, where p1 is the probability that an observation in group 1 will
>be less than an observation in group 2. (This is the nomenclature and
>definition used by
>nQuery)
>
>
>
>I have two questions for the list
>
>1. Is there a short name for p1 defined as above, or some variant on
>it?
>
>2. Do others share my concern that p1 is not the way most people will
>think about differences between treatments (although it is the obvious
>parameter to use to power a Wilcoxon test), which mean it is almost
>impossible to assess whether p1 = 0.75 is an appropriate effect size to
>design for.
1. Names for p1:
'Individual exceedance probability' and 'Dominance probability' have been
suggested.
Roger Newsom pointed out that p1 is closely related to Somer's D (= p1 -
p2), and Robert Newcombe points out that p1 is the same as the AUROC (area
under receiver operating characteristic curve).
2. Concerns about use of p1.
Most respondents affirmed my view that treatment differences are usually
more relevant, and Stephen Senn points out that p1 effectively divides this
by the error standard deviation.
However, this is less true if the outcome scale is unfamiliar or ad hoc.
Detailed responses are given below:
Best wishes
Tim Auton
The views, opinions and judgements expressed in this message are solely
those of the author. The message contents have not been reviewed or
approved by Protherics.
T R Auton PhD MSc C.Math
Head of Biomedical Statistics
Protherics Molecular Design Ltd
The Heath Business and Technical Park
Runcorn
Cheshire
WA7 4QF
UK
email: [log in to unmask]
From: Roger Newson
Lecturer in Medical Statistics
Department of Public Health Sciences
King's College London
5th Floor, Capital House
42 Weston Street
London SE1 3QD
United Kingdom
Email: [log in to unmask]
Hello Tim
A possible name for your "p1" is a dominance probability. However, I
personally tend to think in terms of Somers' D, which (in the case of a
2-sample Wilcoxon test) can be defined as p1-p2, where p1 is the
probability that a randomly-chosen member of Population A has a greater
value than a randomly-chosen member of Population B and p2 is the
probability that a randomly-chosen member of Population B has a greater
value than a randomly-chosen member of Population A. Somers' D has the
advantage that it is still a measure of dominance, even if there are tied
values.
An alternative parameter tested by the Wilcoxon test is the median
difference (or the median ratio in the case of positive-valued outcome
variables). However, median differences and ratios are defined in terms of
Somers' D. Also, power calculations are more easily defined in terms of
detectable levels of Somers' D than in terms of detectable levels of median
difference or ratio, because the Central Limit Theorem typically works a
lot quicker for Somers' D than for the median difference or ratio. (I
suspect that the nQuery package assumes that the 2 population distributions
differ only in location. This assumption will underestimate the standard
error of Somers' D when the larger of the 2 samples is from the less
variable of the 2 populations, and will overestimate the standard error of
Somers' D when the larger of the 2 samples is from the more variable of the
2 populations.) If you do not want to talk about either Somers' D or median
differences or ratios, then your best bet is probably to transform the data
and to measure ratios between geometric means (using log-transformed data)
or ratios or differences between algebraic means (a less common option,
using power-transformed data).
You might like to read an article of mine about the parameters behind
so-called "non-parametric" statistics (Newson, 2002), for which a
pre-publication draft may be downloaded from my website at
http://www.kcl-phs.org.uk/rogernewson/
where youu can also download some papers, and a presentation, about
calculating confidence intervals for these parameters using the Stata
statistical package.
I hope this helps.
Best wishes
Roger
References
Newson R. Parameters behind "nonparametric" statistics: Kendall's tau,
Somers' D and median differences. The Stata Journal 2002; 2(1): 45-64.
From: E-Mail: (Ted Harding) <[log in to unmask]>
This is quite likely to be true. What people will want to know about the
difference between two treatments is, well, their difference! That is to say
(since apparently there are quantitative data here) what is (X2-X1) likely
to be where X2 is from Group 2 and X1 from Group 1.
One way to get a feel for this would be to see what, for different types of
quantitative distribution (Normal, log-Normal, ... ), the parameter
difference corresponding to P(X2<X1) = 0.75 is. In the case of a Normal
N(mu, sigma^2) this will be in terms of mu/sigma, for instance, so would
still leave some questions hanging in the air. In the case of a log-normal,
the parameter difference would be effectively the same, but it corresponds
to differences between log(X2)-log(X1). It could be worked out what
difference (X2-X1) this corresonded to, but it would be even more sensitive
to sigma than (X2-X1) for a Normal distribution. You would need to make a
judgement about which scale (raw or log) was most meaningful in real life.
And so on.
I suspect the study designers may have got cold feet about facing up to the
quantitative issues, preferring to hide behind a distribution free test
which sweeps most quantitative aspects under the carpet!
Best wishes,
Ted.
From: Dr Dennis O. Chanter
Director
Statisfaction Statistical Consultancy Ltd
Tel: +44 (0)1424 219202
Fax: +44 (0)7005 982219
Mobile: +44 (0)7904 101470
E-mail: [log in to unmask]
Hello Tim,
Good question!
However, if p1 (no I don't know a succinct name for it) is not the way
most people think about differences between treatments, then I assume you
mean that most people think in terms of a difference in treatment means
(or maybe medians if the distributions are skewed). But by electing to
use a Wilcoxon test based on ranks, aren't you (or they) already saying
that such methods of thinking about treatment differences are not
appropriate in this case? If you want to think about treatment
differences in terms of estimates of location parameters, then (if there
are distributional difficulties) the obvious choice of test would surely
be a randomisation test. OK it might mean that the power calculations
have to be done by simulation, but at least the characterisation of the
difference and the choice of test statistics are compatible.
Regards
Dennis
From: Robert G. Newcombe, PhD, CStat, Hon MFPH
Reader in Medical Statistics
University of Wales College of Medicine
Heath Park
Cardiff CF14 4XN, UK.
Phone 029 2074 2329 or 2311
Fax 029 2074 3664
Email [log in to unmask]
I don't think anyone has come up with a snappy name for p1 - which
after all strictly isn't a parameter of any distribution. At the
analysis stage, the corresponding quantity is U/mn, the Mann-Whitney-
Wilcoxon U statistic divided by the product of the two sample sizes.
It is the same as the AUROC (area under receiver operating
characteristic curve). Care is needed when distributions are
discrete and hence ties are likely - the definition must include half
the probability that the two observations are equal, and also, even
when this is done, the observed value is then a biased estimate of
the true value - see Hanley JA, McNeil BJ (1982), The meaning and
use of the area under a receiver operating characteristic (ROC)
curve, Radiology 143, 29-36. Even though you refer to a continuous
distribution above, this may still be an issue as in practice all
continuous variables other than those captured digitally are recorded
in discrete form.
I agree that this pseudo-parameter is an obvious choice to base a MWW
power calculation on - and indeed also to use as a sample-size-free
summary measure for summarising the results. Unfortunately it isn't
very familiar to most statisticians, let alone users - a situation I
would very much like to alter. If the outcome scale is obscure or
developed ad hoc, then there is little point in summarising the
results by giving the estimated median difference with a CI, as
others will have little idea of what size of difference on this scale
is important. In this situation, a relative measure seems more
appropriate, and p1 alias U/mn alias AUROC seems the obvious choice.
The best way to visualise what a p1 value amounts to is to draw two
Gaussian distributions each with SD 1, and with peaks d units apart,
i.e. d is the standardised difference. Then the p1 corresponding to
any value of d is PHI(d/sqrt(2)), where PHI is the cdf of the
standard Normal distribution. For p1 = 0.75, d/sqrt(2) is 0.674, so
this choice of p1 corresponds to two Gaussian distributions with
peaks 0.674 * sqrt(2) = 0.95 SDs apart. This admittedly only
converts one relative measure, p1, into another, d, but it does
facilitate an interpretable visual display.
Hope this helps.
From: Stephen Senn
Professor of Statistics
Department of Statistics
15 University Gardens
<http://www.gla.ac.uk>University of Glasgow
G12 8QQ
email [log in to unmask]
Dear Tim,
This is related in my view to the issue of sigma-divided measures. I have
called p1 the 'individual exceedence probability' in
S.J. Senn. (1997) Testing for individual and population equivalence based
on the proportion of similar responses, Statistics in Medicine, 16,
1303-1306. and also in
In my view as a concept it makes no sense except in the context of random
sampling and hence is not really relevant to clinical trials.
Regards
Stephen
|