Loading…

Uneven success: automatic speech recognition and ethnicity-related dialects

•Comparison of accuracy of the Microsoft Speech Services conversational ASR system finds that phonetic error rates are higher for speech samples of nonwhite, than for white, speakers•Sociophonetic variables account for 20% of ASR errors•CLOx allows automated transcription of conversational speech in...

Full description

Saved in:
Bibliographic Details
Published in:Speech communication 2022-05, Vol.140, p.50-70
Main Authors: Wassink, Alicia Beckford, Gansen, Cady, Bartholomew, Isabel
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•Comparison of accuracy of the Microsoft Speech Services conversational ASR system finds that phonetic error rates are higher for speech samples of nonwhite, than for white, speakers•Sociophonetic variables account for 20% of ASR errors•CLOx allows automated transcription of conversational speech in one-fifth of the time required to manually produce an orthographic transcription•Steady improvements to ASR systems greatly expedite and simplify the task of sociolinguistic data analysis Addressing racial bias in automatic speech recognition is an area of concern in fields associated with human-computer interaction. Research to date suggests that sociolinguistic variation, namely systematic sources of sociophonetic variation, has yet to be extensively exploited in acoustic model architectures. This paper reports a study that evaluates the performance of one ASR system for a multi-ethnic sample of speakers from the American Pacific Northwest (including Native American, African American, European American and ChicanX speakers). Using a sociophonetic approach to characterizing vocalic and consonantal variation, we ask which dialect features appear to be most challenging for our ASR system. We also ask which error types are particular to the four ethnic dialects sampled. Recordings of both conversational and read speech were coded for a common set of 18 sociophonetic variables with distinct phonetic profiles. Automatic transcription was achieved using CLOx, a custom-built ASR system created for sociolinguistic analysis. Normalized error frequency rates were compared across ethnic samples to evaluate CLOx performance. Nf error rates demonstrate clear differential performance in the ASR system, pointing to racial bias in system output. Specific predictions are made regarding approaches that might be taken to leverage sociophonetic knowledge to improve social dialect-recognition accuracy in ASR systems.
ISSN:0167-6393
1872-7182
DOI:10.1016/j.specom.2022.03.009