Loading…

Sample-Efficient Unsupervised Domain Adaptation of Speech Recognition Systems: A Case Study for Modern Greek

Modern speech recognition systems exhibit rapid performance degradation under domain shift. This issue is especially prevalent in data-scarce settings, such as low-resource languages, where the diversity of training data is limited. In this work, we propose M2DS2, a simple and sample-efficient fine-...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2024, Vol.32, p.286-299
Main Authors:	Paraskevopoulos, Georgios, Kouzelis, Theodoros, Rouvalis, Georgios, Katsamanis, Athanasios, Katsouros, Vassilis, Potamianos, Alexandros
Format:	Article
Language:	English
Subjects:	Adaptation Adaptation models Automatic speech recognition Data models greek speech multi-domain evaluation Performance degradation Reverberation Speech processing Speech recognition Training Training data unsupervised domain adaptation Voice recognition
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Modern speech recognition systems exhibit rapid performance degradation under domain shift. This issue is especially prevalent in data-scarce settings, such as low-resource languages, where the diversity of training data is limited. In this work, we propose M2DS2, a simple and sample-efficient fine-tuning strategy for large pre-trained speech models, based on mixed source and target domain self-supervision. We find that including source domain self-supervision stabilizes training and avoids mode collapse of the latent representations. For evaluation, we collect HParl, a 120-hour speech corpus for Greek, consisting of plenary sessions in the Greek Parliament. We merge HParl with two popular Greek corpora to create GREC-MD, a test-bed for multi-domain evaluation of Greek ASR systems. In our experiments, we find that, while other Unsupervised Domain Adaptation baselines fail in this resource-constrained environment, M2DS2 yields significant improvements for cross-domain adaptation, even when only a few hours of in-domain audio are available. When we relax the problem in a weakly supervised setting, we find that independent adaptation for audio using M2DS2 and language using simple LM augmentation techniques is particularly effective, yielding word error rates comparable to the fully supervised baselines.
ISSN:	2329-9290 2329-9304
DOI:	10.1109/TASLP.2023.3328280