Thanks for your comments. I am intending to measure within
and between professions inter-observer variation AND measure
intra-observer variation for observers within the two professions
(i.e. radiographers vs radiologists). I am also repeating the
process of comparing reports for a random sub-sample to measure
inter-arbiter and intra-arbiter variation. Where arbiter refers
to the person responsible for comparing reports. The reason for
doing this is that inter-observer variation may be partly an effect
of inconsistency in the process of comparing reports to judge
whether they agree or not. However, i don't know how you could
use this information to calculate the "true" inter-observer variation.
Any ideas????
Stephen Brealey
Department of Health Sciences
And Clinical Evaluation
University of York
SKIP LANTZ wrote:
>
> To Stephen Brealey (& nRobert Newcomb)
>
> I just got into the loop of this discussion via Stephen Perle and I find Kappa fascinating!! I particularly like Dr Newcomb's approach to Kappa analysis and would like to hear more about the procedures for managing multiple 2x2 matrices. I would like to know more about this aspect.
>
> Regarding the design for the radiologist vs radiographer, I have some additional recommendations. A much strounger design would ensue if you added a second rater of each category. You could then compare within specialty and between specialty reliabilities and perform a reliability mapping similar that described in Spine 1999; 24(11):1082-1089 for inter- and intra-examiner reliabilities. That way you could compare the inter and intra-examiner reliabilities for a better perspective. It would also be interesting to determine the intra-examiner reliabilities for the individual examiners...perhaps on a smaller subset of subjects. Such a simple design would provide for substantial power in the analysis, it would seem to me (power taken in a more generic, rather than statistical sense...but the latter could apply as well).
>
> >>> "Dr. Robert Newcombe" <[log in to unmask]> 06/11/99 11:47AM >>>
> > Date: Mon, 31 May 1999 17:58:37 +0100
> > Subject: Inconsistency in the use of Kappa
> > From: sb143 <[log in to unmask]> (Stephen Brealey,
> > Department of Health Sciences, University of York)
> > To: [log in to unmask]
> >
> > Even in my limited experience i have seen Kappa used in several
> > different ways when measuring inter-observer variation in the
> > context of radiology. I would be extremely grateful if somebody
> > could clarify the following few points for me:
> >
> > (a) usually Kappa is used to measure reliability in the
> > interpretation of films by two radiologists i.e. independent
> > individuals within the same profession. Would it be possible
> > to use Kappa to measure reliability between two independent
> > individuals but from different professional groups (i.e.
> > radiographer vs radiologist)?
>
> No reason why not. You would expect closer agreement between two
> radiologists than between a radiographer and a radiologist. In
> either situation, though, there could be consistent as well as random
> differences between observers. It is important to look at what is
> called marginal homogeneity - i.e. whether both individuals classify
> similar %s of subjects as positive - as well as the kind of agreement
> that is measured by kappa. Often, "kappa" is calculated using
> expected cell frequencies derived from homogenised marginal
> frequencies. That is, p is taken as (a + (b+c)/2)/(a + b + c + d),
> and expected proportional agreement is p**2 + (1-p)**2. Strictly,
> this is Scott's pi (1955) which preceded the unhomogenised version
> which Cohen called kappa in 1960. Kappa has acquired the
> reputation of the gold standard. But often, pi is calculated, but
> incorrectly called kappa. And the argument used to say that kappa is
> optimal is really a misunderstanding, it's pi, not kappa, that
> encompasses both types of disagreement. See Zwick, Psychol Bull
> 1988, 103, 374-378 and also my letter to J Clin Epi, 1996, 49, 1325.
> An excellent CI method for kappa (really pi) for the 2 by 2 table is
> given by Donner & Eliasziw, Stats in Medicine 1992, 11, 1511-1519.
>
> All this assumes the classification is inherently binary. Estimation
> carries across straightforwardly if the assessment scale has more
> than 2 points but isn't ordinal, though CIs become much more
> complicated. If it's ordinal, then weighted versions are available,
> which penalise slight disagreements less heavily than larger ones.
>
> > (b) Kappa is used to measure reliability between two individuals
> > who each report on the same batch of films. Would it be possible
> > to use Kappa to measure consistency between two independent groups
> > of people from different professions who shared the burden of reporting
> > on all the films?
>
> Yes, there are several ways to do this. If each film is seem by just
> one radiographer (chosen randomly from a pool) and by one radiologist
> (chosen randomly from a pool), then just calculate kappa (or pi) in
> the usual way - though I wouldn't expect the CI calculation to carry
> across. If each of n films is seen by each of k1 radiographers
> and each of k2 radiologists, each of whom classifies the film into
> one of m categories, the simplest approach is to populate an m by m
> agreement table, giving agreement between an arbitrarily chosen
> radiographer and an arbitrarily chosen radiologist. The table total
> is then n * k1 * k2. Kappa (or pi) can be estimated in the usual
> way, weighted if ordinal - again, CIs are problematic.
>
> > (c) Finally, if an observers performance is compared with a "gold"
> > standard (e.g. consensus report) then this is a measure of accuracy and
> > Sn and Sp is calculated, rather than a measure of reliability. Is it
> > possible to use Kappa by measuring reliability between the gold standard
> > and an independent observer (i.e. consistently accurate), or is this not
> > possible because you are changing your assumption regarding whether the
> > gold standard is providing the "true" diagnosis or not?
>
> This is really a different problem, as the sampling imprecision of
> the "gold standard" doesn't have the same form as that of the
> individual measurement. Nonetheless kappa is often used. If the
> "gold standard" is a consensus (mode if categorical, median
> if ordinal) of the k observers, then the agreement of a randomly
> chosen observer with the gold standard, calculated from an m by m
> table with table total n * k, is of course artefactually high,
> because each of the observers has contributed to the consensus - the
> bias is great when k=3 (you can't have a consensus of 2!), and falls
> off as k decreases. Sensitivity and specificity have the same
> problem.
>
> Hope this helps.
>
> Robert Newcombe.
>
> ..........................................
> Robert G. Newcombe, PhD, CStat, Hon MFPHM
> Senior Lecturer in Medical Statistics
> University of Wales College of Medicine
> Heath Park
> Cardiff CF14 4XN, UK.
> Phone 01222 742329 or 742311
> Fax 01222 743664
> Email [log in to unmask]
>
> Macros for good methods for confidence intervals
> for proportions and their differences available at
> http://www.uwcm.ac.uk/uwcm/ms/Robert.html
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|