Loading…

The Effect of the Raters' Marginal Distributions on Their Matched Agreement: A Rescaling Framework for Interpreting Kappa

Cohen's κ measures the improvement in classification above chance level and it is the most popular measure of interjudge agreement. Yet, there is considerable confusion about its interpretation. Specifically, researchers often ignore the fact that the observed level of matched agreement is boun...

Full description

Saved in:

Bibliographic Details
Published in:	Multivariate behavioral research 2013-11, Vol.48 (6), p.923-952
Main Authors:	Karelitz, Tzur M., Budescu, David V.
Format:	Article
Language:	English
Subjects:	Classification Comparative analysis Simulation
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Cohen's κ measures the improvement in classification above chance level and it is the most popular measure of interjudge agreement. Yet, there is considerable confusion about its interpretation. Specifically, researchers often ignore the fact that the observed level of matched agreement is bounded from above and below and the bounds are a function of the particular marginal distributions of the table. We propose that these bounds should be used to rescale the components of κ (observed and expected agreement). Rescaling κ in this manner results in κ′, a measure that was originally proposed by Cohen (1960) and was largely ignored in both research and practice. This measure provides a common scale for agreement measures of tables with different marginal distributions. It reaches the maximal value of 1 when the judges show the highest level of agreement possible, given their marginal disagreements. We conclude that κ′ should be used to measure the level of matched agreement contingent on a particular set of marginal distributions. The article provides a framework and a set of guidelines that facilitate comparisons between various types of agreement tables. We illustrate our points with simulations and real data from two studies-one involving judges' ratings of baseball players and one involving ratings of essays in high-stakes tests.
ISSN:	0027-3171 1532-7906
DOI:	10.1080/00273171.2013.830064