Loading…
Uneven success: automatic speech recognition and ethnicity-related dialects
•Comparison of accuracy of the Microsoft Speech Services conversational ASR system finds that phonetic error rates are higher for speech samples of nonwhite, than for white, speakers•Sociophonetic variables account for 20% of ASR errors•CLOx allows automated transcription of conversational speech in...
Saved in:
Published in: | Speech communication 2022-05, Vol.140, p.50-70 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | •Comparison of accuracy of the Microsoft Speech Services conversational ASR system finds that phonetic error rates are higher for speech samples of nonwhite, than for white, speakers•Sociophonetic variables account for 20% of ASR errors•CLOx allows automated transcription of conversational speech in one-fifth of the time required to manually produce an orthographic transcription•Steady improvements to ASR systems greatly expedite and simplify the task of sociolinguistic data analysis
Addressing racial bias in automatic speech recognition is an area of concern in fields associated with human-computer interaction. Research to date suggests that sociolinguistic variation, namely systematic sources of sociophonetic variation, has yet to be extensively exploited in acoustic model architectures. This paper reports a study that evaluates the performance of one ASR system for a multi-ethnic sample of speakers from the American Pacific Northwest (including Native American, African American, European American and ChicanX speakers). Using a sociophonetic approach to characterizing vocalic and consonantal variation, we ask which dialect features appear to be most challenging for our ASR system. We also ask which error types are particular to the four ethnic dialects sampled. Recordings of both conversational and read speech were coded for a common set of 18 sociophonetic variables with distinct phonetic profiles. Automatic transcription was achieved using CLOx, a custom-built ASR system created for sociolinguistic analysis. Normalized error frequency rates were compared across ethnic samples to evaluate CLOx performance. Nf error rates demonstrate clear differential performance in the ASR system, pointing to racial bias in system output. Specific predictions are made regarding approaches that might be taken to leverage sociophonetic knowledge to improve social dialect-recognition accuracy in ASR systems. |
---|---|
ISSN: | 0167-6393 1872-7182 |
DOI: | 10.1016/j.specom.2022.03.009 |