Loading…

Uneven success: automatic speech recognition and ethnicity-related dialects

•Comparison of accuracy of the Microsoft Speech Services conversational ASR system finds that phonetic error rates are higher for speech samples of nonwhite, than for white, speakers•Sociophonetic variables account for 20% of ASR errors•CLOx allows automated transcription of conversational speech in...

Full description

Saved in:

Bibliographic Details
Published in:	Speech communication 2022-05, Vol.140, p.50-70
Main Authors:	Wassink, Alicia Beckford, Gansen, Cady, Bartholomew, Isabel
Format:	Article
Language:	English
Subjects:	Acoustic phonetics African Americans Automatic speech recognition Bias Dialects Error analysis Ethnicity Human-computer interaction Performance evaluation Phonetic transcription Phonetic variation Racial bias Racism Regional dialects Sociolinguistics Sociophonetics Speech recognition Voice recognition
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	•Comparison of accuracy of the Microsoft Speech Services conversational ASR system finds that phonetic error rates are higher for speech samples of nonwhite, than for white, speakers•Sociophonetic variables account for 20% of ASR errors•CLOx allows automated transcription of conversational speech in one-fifth of the time required to manually produce an orthographic transcription•Steady improvements to ASR systems greatly expedite and simplify the task of sociolinguistic data analysis Addressing racial bias in automatic speech recognition is an area of concern in fields associated with human-computer interaction. Research to date suggests that sociolinguistic variation, namely systematic sources of sociophonetic variation, has yet to be extensively exploited in acoustic model architectures. This paper reports a study that evaluates the performance of one ASR system for a multi-ethnic sample of speakers from the American Pacific Northwest (including Native American, African American, European American and ChicanX speakers). Using a sociophonetic approach to characterizing vocalic and consonantal variation, we ask which dialect features appear to be most challenging for our ASR system. We also ask which error types are particular to the four ethnic dialects sampled. Recordings of both conversational and read speech were coded for a common set of 18 sociophonetic variables with distinct phonetic profiles. Automatic transcription was achieved using CLOx, a custom-built ASR system created for sociolinguistic analysis. Normalized error frequency rates were compared across ethnic samples to evaluate CLOx performance. Nf error rates demonstrate clear differential performance in the ASR system, pointing to racial bias in system output. Specific predictions are made regarding approaches that might be taken to leverage sociophonetic knowledge to improve social dialect-recognition accuracy in ASR systems.
ISSN:	0167-6393 1872-7182
DOI:	10.1016/j.specom.2022.03.009