Loading…

Understanding Dataset Difficulty with \(\mathcal{V}\)-Usable Information

Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. However, this comparison provides little understanding of how difficult each instance in a given distribution is, or what attri...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2022-06
Main Authors:	Kawin Ethayarajh, Choi, Yejin, Swayamdipta, Swabha
Format:	Article
Language:	English
Subjects:	Annotations Datasets Information theory
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. However, this comparison provides little understanding of how difficult each instance in a given distribution is, or what attributes make the dataset difficult for a given model. To address these questions, we frame dataset difficulty -- w.r.t. a model \(\mathcal{V}\) -- as the lack of \(\mathcal{V}\)-\(\textit{usable information}\) (Xu et al., 2019), where a lower value indicates a more difficult dataset for \(\mathcal{V}\). We further introduce \(\textit{pointwise \)\mathcal{V}\(-information}\) (PVI) for measuring the difficulty of individual instances w.r.t. a given distribution. While standard evaluation metrics typically only compare different models for the same dataset, \(\mathcal{V}\)-\(\textit{usable information}\) and PVI also permit the converse: for a given model \(\mathcal{V}\), we can compare different datasets, as well as different instances/slices of the same dataset. Furthermore, our framework allows for the interpretability of different input attributes via transformations of the input, which we use to discover annotation artefacts in widely-used NLP benchmarks.
ISSN:	2331-8422