Loading…

Minority Views Matter: Evaluating Speech Emotion Classifiers with Human Subjective Annotations by an All-Inclusive Aggregation Rule

When selecting test data for subjective tasks, most studies define ground truth labels using aggregation methods such as the majority or plurality rules. These methods discard data points without consensus, making the test set easier than practical tasks where a prediction is needed for each sample....

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on affective computing 2024-06, p.1-15
Main Authors: Chou, Huang-Cheng, Goncalves, Lucas, Leem, Seong-Gyun, Salman, Ali N., Lee, Chi-Chun, Busso, Carlos
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:When selecting test data for subjective tasks, most studies define ground truth labels using aggregation methods such as the majority or plurality rules. These methods discard data points without consensus, making the test set easier than practical tasks where a prediction is needed for each sample. However, the discarded data points often express ambiguous cues that elicit coexisting traits perceived by annotators. This paper addresses the importance of considering all the annotations and samples in the data, highlighting that only showing the model's performance on an incomplete test set selected by using the majority or plurality rules can lead to bias in the models' performances. We focus on speech-emotion recognition (SER) tasks. We observe that traditional aggregation rules have a data loss ratio ranging from 5.63% to 89.17%. From this observation, we propose a flexible method named the all-inclusive aggregation rule to evaluate SER systems on the complete test data. We contrast traditional single-label formulations with a multi-label formulation to consider the coexistence of emotions. We show that training an SER model with the data selected by the all-inclusive aggregation rule shows consistently higher macro-F1 scores when tested in the entire test set, including ambiguous samples without agreement.
ISSN:1949-3045
1949-3045
DOI:10.1109/TAFFC.2024.3411290