Dear Allstat,
I posted a query yesterday to the list.
Thank-you very much to Robert Newcombe, Paul Seed and
Francois Harel.
I provide their responses below for those interested.
Please contact me directly if you want details of the SAS macro
mentioned in the final email.
Original query
--------------
We've created a simple prognostic index for sudden cardiac
death. Patients can score a minimum of 0 and a maximum of 11.
We've then classified patients into low or high risk based
on possible combinations of low/high score (i.e. low=0/high=1-11,
low=0-1/high=2-11, ..., low=0-10/high=11) and calculated
sensitivity and specificity at each of these points. We've then
produced a receiver operating characteristic (ROC) curve and
have calculated the area under the curve (AUC).
We are interesting in different types of death and it would
be useful to apply the same prognostic index to patients
who die of progressive heart failure and (hopefully) make a
statement that the prognostic index is significantly "better" for
sudden cardiac death (as it was designed) than for progressive
heart failure (in terms of AUC).
Does anyone have any suggestions of how we can go about this?
The standard error of the AUCs can be calculated and we could
obtain a measure of the standard error of the difference of the
AUCs, but this would result in a conservative test as cause of
death is probably not independent.
I am aware that there are references and methods available when
comparing two different indices with regard to the same cause of
death.
Thank-you in advance,
Mandy
Response from Robert Newcombe
-----------------------------
A very interesting problem! I think the key to it is to realise that
the AUROC is equivalent to the Mann-Whitney statistic relating to the
separation/overlap of the distributions of scores in the deaths and
the non-deaths. From what you say above, your results might look
something like this.
Group 1: Survivors: n1 subjects, median score 3
Group 2: Progressive heart failure: n2 subjects, median score 5
Group 3: Sudden cardiac death: n3 subjects, median score 8.
This would correspond to better prediction of sudden death than of
progressive heart failure. We would characterise this by calculating
the AUROCs for group 2 v. group 1, and for group 3 v. group 1. Or
equivalently, calculate Mann-Whitney U for group 2 v. group 1, and
divide by n1*n2 to make it sample size free; calculate M-W U for
group 3 v. group 1, and divide by n1*n3. Each of these AUROC or
U/(ni*nj) values can have confidence intervals calculated by the
delta method (Hanley & McNeil 1983) or by a more refined one (Mee
1990).
Then, to show that prediction of sudden cardiac death is better than
prediction of progressive heart failure, we could simply do a M-W
test comparing groups 2 and 3. If group 3's scores are shifted to
the right of those for group 2, this (virtually) implies that
prediction is better for outcome 3 than for outcome 2.
My only reservation in saying this is that applying M-W tests and
related methods to all pairwise comparisons of 3 groups is precisely
the situation in which the Condorcet (or Escher staircase) paradox
can arise. That is, it is possible to construct data for which
group 1 < group 2 < group 3 < group 1.
For example, consider 3 samples each of 3 observations:
Group 1: 1, 6, 8
Group 2: 2, 4, 9
Group 3: 3, 5, 7
If we omit group 3 and just compare groups 1 and 2, then by the M-W
criterion group 2 is shifted to the right relative to group 1.
Similarly for groups 3 and 2 and for groups 1 and 3. Obviously for
these data, all these differences are very small and nowhere near
significant, but if we replicated each of the observations k times,
we could make each of the M-W tests arbitrarily highly significant.
Thus we would have to ensure that the data didn't do something
paradoxical like this. But if it looks well-behaved, then I think it
is reasonable to seek to show that scores in group 3 are higher than
those in group 2, and to take this as the best evidence that the
scoring system is better for predicting sudden death than chronic
heart failure.
My interest in this issue is particularly high at the moment because
I'm developing an improved confidence interval method for the index
that I would generally refer to as U/mn - the Mann-Whitney statistic
divided by the product of the 2 sample sizes. The Hanley-McNeil and
Mee methods don't produce an interval for the extreme case of no
overlap (U/mn = 0 or 1), an important case in many laboratory
applications, and I suspect they wouldn't work well in near-extreme
cases also. My method is designed to cope with either very small
sample sizes or non-overlap. While I'm sure that you're likely to be
in the situation in which the existing methods are adequate,
nevertheless I would be very interested to see your data. I hope to
develop a program to implement the Mee method shortly, and if you
would like me to run your data through this program, I would be very
happy to do so.
References.
Hanley JA, McNeil BJ (1982). The meaning and use of the area under a
receiver operating characteristic (ROC) curve. Radiology 143, 29-36.
Mee RW (1990). Confidence intervals for probabilities and tolerance
regions based on a generalisation of the Mann-Whitney statistic.
Journal of the American Statistical Association 85, 793-800.
Best wishes.
Robert Newcombe.
Response from Paul Seed
-----------------------
All of this is taken care of within Stata by the
roc commands: roctab, roccomp etc.
References to the statisticla literature are given in the Stata manuals.
See http://www.stata.com for details on price etc.
Reponse from Francois Harel
---------------------------
Attached file provide ROC.SAS (contact [log in to unmask] if
you wish to be sent this file)
Hi,
Maybe you can use a nonparametric comparison test of areas under correlated
ROC curves with the macro roc.sas available from SAS web site.
REFERENCES:
E.R. DeLong, D.M. DeLong, and D.L. Clarke-Pearson (1988),
"Comparing the Areas Under Two or More Correlated Receiver
Operating Characteristic Curves: A Nonparametric Approach,"
Biometrics 44, p. 837-845.
But your predictors (combinations of scores) seem to be more than
correlated, I think that they are nested.
However, it may be a solution.
I hope that it helps.
If you get other answers, please let me know. I'm very interested in the
solution because I can eventually have similar problems myself.
|