> Date: Mon, 31 May 1999 17:58:37 +0100
> Subject: Inconsistency in the use of Kappa
> From: sb143 <[log in to unmask]> (Stephen Brealey,
> Department of Health Sciences, University of York)
> To: [log in to unmask]
>
> Even in my limited experience i have seen Kappa used in several
> different ways when measuring inter-observer variation in the
> context of radiology. I would be extremely grateful if somebody
> could clarify the following few points for me:
>
> (a) usually Kappa is used to measure reliability in the
> interpretation of films by two radiologists i.e. independent
> individuals within the same profession. Would it be possible
> to use Kappa to measure reliability between two independent
> individuals but from different professional groups (i.e.
> radiographer vs radiologist)?
No reason why not. You would expect closer agreement between two
radiologists than between a radiographer and a radiologist. In
either situation, though, there could be consistent as well as random
differences between observers. It is important to look at what is
called marginal homogeneity - i.e. whether both individuals classify
similar %s of subjects as positive - as well as the kind of agreement
that is measured by kappa. Often, "kappa" is calculated using
expected cell frequencies derived from homogenised marginal
frequencies. That is, p is taken as (a + (b+c)/2)/(a + b + c + d),
and expected proportional agreement is p**2 + (1-p)**2. Strictly,
this is Scott's pi (1955) which preceded the unhomogenised version
which Cohen called kappa in 1960. Kappa has acquired the
reputation of the gold standard. But often, pi is calculated, but
incorrectly called kappa. And the argument used to say that kappa is
optimal is really a misunderstanding, it's pi, not kappa, that
encompasses both types of disagreement. See Zwick, Psychol Bull
1988, 103, 374-378 and also my letter to J Clin Epi, 1996, 49, 1325.
An excellent CI method for kappa (really pi) for the 2 by 2 table is
given by Donner & Eliasziw, Stats in Medicine 1992, 11, 1511-1519.
All this assumes the classification is inherently binary. Estimation
carries across straightforwardly if the assessment scale has more
than 2 points but isn't ordinal, though CIs become much more
complicated. If it's ordinal, then weighted versions are available,
which penalise slight disagreements less heavily than larger ones.
> (b) Kappa is used to measure reliability between two individuals
> who each report on the same batch of films. Would it be possible
> to use Kappa to measure consistency between two independent groups
> of people from different professions who shared the burden of reporting
> on all the films?
Yes, there are several ways to do this. If each film is seem by just
one radiographer (chosen randomly from a pool) and by one radiologist
(chosen randomly from a pool), then just calculate kappa (or pi) in
the usual way - though I wouldn't expect the CI calculation to carry
across. If each of n films is seen by each of k1 radiographers
and each of k2 radiologists, each of whom classifies the film into
one of m categories, the simplest approach is to populate an m by m
agreement table, giving agreement between an arbitrarily chosen
radiographer and an arbitrarily chosen radiologist. The table total
is then n * k1 * k2. Kappa (or pi) can be estimated in the usual
way, weighted if ordinal - again, CIs are problematic.
> (c) Finally, if an observers performance is compared with a "gold"
> standard (e.g. consensus report) then this is a measure of accuracy and
> Sn and Sp is calculated, rather than a measure of reliability. Is it
> possible to use Kappa by measuring reliability between the gold standard
> and an independent observer (i.e. consistently accurate), or is this not
> possible because you are changing your assumption regarding whether the
> gold standard is providing the "true" diagnosis or not?
This is really a different problem, as the sampling imprecision of
the "gold standard" doesn't have the same form as that of the
individual measurement. Nonetheless kappa is often used. If the
"gold standard" is a consensus (mode if categorical, median
if ordinal) of the k observers, then the agreement of a randomly
chosen observer with the gold standard, calculated from an m by m
table with table total n * k, is of course artefactually high,
because each of the observers has contributed to the consensus - the
bias is great when k=3 (you can't have a consensus of 2!), and falls
off as k decreases. Sensitivity and specificity have the same
problem.
Hope this helps.
Robert Newcombe.
..........................................
Robert G. Newcombe, PhD, CStat, Hon MFPHM
Senior Lecturer in Medical Statistics
University of Wales College of Medicine
Heath Park
Cardiff CF14 4XN, UK.
Phone 01222 742329 or 742311
Fax 01222 743664
Email [log in to unmask]
Macros for good methods for confidence intervals
for proportions and their differences available at
http://www.uwcm.ac.uk/uwcm/ms/Robert.html
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|