Loading…

Multiple‐rater kappas for binary data: Models and interpretation

Interrater agreement on binary measurements with more than two raters is often assessed using Fleiss' κ, which is known to be difficult to interpret. In situations where the same raters rate all items, however, the far less known κ suggested by Conger, Hubert, and Schouten is more appropriate....

Full description

Saved in:

Bibliographic Details
Published in:	Biometrical journal 2018-03, Vol.60 (2), p.381-394
Main Authors:	Stoyan, Dietrich, Pommerening, Arne, Hummel, Manuela, Kopp‐Schneider, Annette
Format:	Article
Language:	English
Subjects:	Binary data binary ratings Bioinformatics (Computational Biology) Bioinformatik (beräkningsbiologi) Biometry - methods carcinoma data Clustering Conger–Hubert–Schouten kappa Data Interpretation, Statistical Data processing Fleiss’ kappa Humans Mathematical models modeling rater behavior Models, Statistical Pathology Subgroups
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Interrater agreement on binary measurements with more than two raters is often assessed using Fleiss' κ, which is known to be difficult to interpret. In situations where the same raters rate all items, however, the far less known κ suggested by Conger, Hubert, and Schouten is more appropriate. We try to support the interpretation of these characteristics by investigating various models or scenarios of rating. Our analysis, which is restricted to binary data, shows that conclusions concerning interrater agreement by κ heavily depend on the population of items or subjects considered, even if the raters have identical behavior. The standard scale proposed by Landis and Koch, which verbally interprets numerical values of κ, appears to be rather subjective. On the basis of one of the models for rater behavior, we suggest an alternative verbal interpretation for kappa. Finally, we reconsider a classical example from pathology to illustrate the application of our methods and models. We also look for subgroups of raters with similar rating behavior using hierarchical clustering.
ISSN:	0323-3847 1521-4036 1521-4036
DOI:	10.1002/bimj.201600267