Dear Eric,
Sounds like you have an idea already: a leave-one-out analysis for all the ideas by single measure? See the impact on ICCs for each?
Could you also do some arithmetic on all the individual ratings, e.g., take pairwise differences, and then sort the 350 ideas by those differences?
Perhaps a mixed effects model with items and raters as random effects would be useful too, then you could look at the various random effect estimates, including residuals, to see what's going on?
Best wishes,
Andy
--
Dr Andy Fugard, Lecturer, Educational Psychology Group
Research Department of Clinical, Educational and Health Psychology
University College London, 26 Bedford Way, London WC1H 0AP
Tel: +44 (0)20 7679 7554 (ext 27554) www.andyfugard.info
-----Original Message-----
From: A discussion list for methods and statistics used in psychological research. [mailto:[log in to unmask]] On Behalf Of Eric Rietzschel
Sent: 18 October 2015 19:42
To: [log in to unmask]
Subject: Interrater reliability issues
Dear colleagues,
I am a creativity researcher, and as such I often use the Intraclass Correlation as a measure of interrater consistency. This usually works fine; after a couple of practice rounds, raters easily achieve > .60 levels. I am currently working on a dataset that poses some difficulties. Four other raters and myself have rated 2 sets of about 350 ideas on 4 different dimensions (such as originality, practicality, and effectiveness). Although the initial results in a practice round were very promising, the last set of codings for the complete set has yielded very low ICCs - 'average measure' ICCs are between .60 and .70, but 'single measure' ICCs are somewhere between .20 and .50 - well below acceptable levels. Since we intend to use these codings to create subsets of pre-coded ideas for a future study, having sufficient reliability on the single measure ICC is quite important here.
The problem is that it is not clear at all what causes the low reliability. Discussion of the ideas did not yield many differences of opinion, and zero-order correlations between all pairs of raters are positive (while not very strong). Having investigated several options, I am now wondering whether there are specific strategies I should use to find out what's going wrong - are there, for example, particular ideas that are especially problematic? Deleting single raters does not make a difference, so the problem does not lie with individual raters (which could easily have been the case). Ideally, I would love to use some sort of 'ICC diagnostics' to see whether there are highly influential cases (i.e., ideas) in the sets that bring down reliability - similarly to the diagnostics that one can use in a regression analysis. But perhaps there's other things I could try out as well. Any suggestions are highly welcome!
Kindest regards,
Eric Rietzschel
___________________
Dr. Eric F. Rietzschel
University of Groningen
Social and Organizational Psychology
Phone: +31 (0)50 363 6357
Web: http://www.rug.nl/staff/e.f.rietzschel
The contents of this message and its attachment(s) are confidential and only intended for the eyes of the addressee(s).
|