Dear all,
a few days ago I posted a query on intra-rater reliability. There were a number of very valuable responses which I paste below (I cutted the adress details of the respondents). Thanks go out to all responders. I think I will follow the idea to use G-theory. This is a promising Anova-based variance partioning approach. Though I have some literature and examples, I'm still having problems with defining my crossed and nested facets. Also, I can't figure out yet how to do the calculations in STATA or SPSS. So if there are some experienced "G-theory experts", I would be very thankful if someone could pass a worked example to me or contact me off-list.
Thank you all for your help!
Best regards
David
My orginal query:
I'm having a problem finding an appropriate measure of intra-observer reliability. My data is as follows: I have data of roughly 700 judges. Each judged rated the same 7 items on two different occasions on a 1-9 point scale. I'm looking for some measure that gives me an impression of the reliability of the judges.
Responses:
Could I point you in the direction of
AUTHOR Fleiss, Joseph L.
TITLE Statistical methods for rates and proportions / Joseph L.
Fleiss.
EDITION 2d ed.
IMPRINT New York : Wiley, c1981.
DESCRIPT xviii, 321 p. ; 24 cm.
SERIES Wiley series in probability and mathematical statistics.
Rhhiannon Whitaker
You use the word 'reliability.' Which can mean a very specific thing -
confidence interval, alpha risk, or power - beta risk.What I did once upon a time, was to perform an AoV, in which raters were a specific factor, and removed the differences in raters from the rated objects. Then I looked at the differences between raters, and found that I could see these quite easily. I also found no interactions
between raters and rated item levels. Maybe that would help you, maybe not. Shall I expand on the procedure?
Jay
You have two measurement facets, assuming items are the object of
assessment, one being rater, the other being time. To do both (or to merely
do intra-observer with 700 of them) I would suggest a generalizability
analysis. It can be managed with any program that will give variance
components plus a few simple computations. Some references:
Brennan, R.L. (1983). Elements of generalizability theory. Iowa City, IA:
ACT Publications. (Brennan, 1983).
Cronbach, L.J., Gleser, G.C., Nanda, H., & Rajaratnam, N. (1972). The
dependability of behavioral measurement: Theory of generalizability for
scores and profiles. New York: John Wiley & Sons, Inc. (Cronbach, Gleser,
Nanda, & Rajaratnam, 1972).
Paul R. Swank, Ph.D.
The way I would have gone around this problem is as follows:
Your response variable is the number of judges which gave a score
of 1 to 9. Your indepedent variables are: items (1-7), score (1-9),
occasion (1, 2), judge (1-700). You could take "judge" as a random
effect and fit a Poisson regression of the dependent variable on the
independent ones. A simpler approach could be to take the difference on each judge
between the two occasions and then do a simple ANOVA model
using the difference as your response variable, and taking "judge" as
a random effect.Hope I did not confuse you that much ...
Best wishes
Dr Dimitris N Lambrou
You could try a kappa (or weighted kappa test).
Shelley
This sounds a little tricky. What is your 9-point scale? Is it ordinal or quantitative to a degree? Do you really want to measure the intra-observer reliability of the judges (i.e. how reproducible are the two ratings each judge makes), or are you interested in inter-observer reliability? If the former, and your scale is quantitative, then you could use Shrout & Fleiss' (1979) intraclass correlation coefficient (version 1,1), but you only have 7 'subjects' for each judge, which is pretty low for assessing intra-rater reliability, and you would end up with 700 measures (one for each judge). If you want to assess how well the judges agree with each other, then you could take one of each judge's two ratings and calculate ICC (version 2,1). I do have a further reference which suggests you could combine the two above approaches and assess both inter- and intra-rater reliability, using all 700 judges making 2 repeated measurements on 7 items (thus you could assess inter-rater reliability without losing any data by omitting one rating/averaging the two ratings for each item) (Eliasziw, M., Young, S. L., Woodbury, M. G. & Fryday-Field, K. (1994) Statistical methodology for the concurrent assessment of interrater and intrarater reliability: Using goniometric measurements as an example. Physical Therapy 74: 777-788), but this would still provide 700 different measures of intra-observer reliability. One word of caution: you have few repeated measures (2) on very few subjects (7) for each judge, and since the ICC is a measure of the ratio of between subjects variation to within subjects variation you might not get good agreement simply because there is not enough variation between the subjects. When you say 7 different items, do you mean 7 apples, or one apple, one orange, one banana etc? If the latter, then this would not really permit the above analysis because there might be reason to expect agreement to differ between items. If the former, the sample size consideration is important: in the Eliasziw paper the example they give is of two machines each making 3 measurements on 30 patients. You have 700 judges making 2 measurements on 7 items. I'd be interested to see how the results came out.
If, on the other hand, your scale is ordinal, then you might need kappa analysis or weighted kappa, which is itself analogous to ICC in places. See
Shrout, P. E. & Fleiss, J. L. (1979) Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin 86: 420-428
Fleiss, J. L. (1975) Measuring agreement between two judges on the presence or absence of a trait. Biometrics 31: 651-659.
Fleiss, J. L. & Cohen, J. (1973) The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement 33: 613-619.
I hope this has been of some help. I would be grateful if you could forward any other advice you receive.
Regards
Liz Hensor
__________________________________________________
Verpassen Sie keine eBay-Auktion und bieten Sie bequem
und schnell über das Telefon mit http://www.telefonbieten.de
Ihre eMails auf dem Handy lesen - ohne Zeitverlust - 24h/Tag
eMail, FAX, SMS, VoiceMail mit http://www.directbox.com
|