Loading…
Common Flaws in Running Human Evaluation Experiments in NLP
While conducting a coordinated set of repeat runs of human evaluation experiments in NLP, we discovered flaws in every single experiment we selected for inclusion via a systematic process. In this squib, we describe the types of flaws we discovered, which include coding errors (e.g., loading the wro...
Saved in:
Published in: | Computational linguistics - Association for Computational Linguistics 2024-06, Vol.50 (2), p.795-805 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | While conducting a coordinated set of repeat runs of human evaluation experiments
in NLP, we discovered flaws in every single experiment we selected for inclusion
via a systematic process. In this squib, we describe the types of flaws we
discovered, which include coding errors (e.g., loading the wrong system outputs
to evaluate), failure to follow standard scientific practice (e.g., ad hoc
exclusion of participants and responses), and mistakes in reported numerical
results (e.g., reported numbers not matching experimental data). If these
problems are widespread, it would have worrying implications for the rigor of
NLP evaluation experiments as currently conducted. We discuss what researchers
can do to reduce the occurrence of such flaws, including pre-registration,
better code development practices, increased testing and piloting, and
post-publication addressing of errors. |
---|---|
ISSN: | 0891-2017 1530-9312 |
DOI: | 10.1162/coli_a_00508 |