Loading…

Continual Learning for On-Device Speech Recognition Using Disentangled Conformers

Automatic speech recognition research focuses on training and evaluating on static datasets. Yet, as speech models are increasingly deployed on personal devices, such models encounter user-specific distributional shifts. To simulate this real-world scenario, we introduce LibriContinual, a continual...

Full description

Saved in:

Bibliographic Details
Main Authors:	Diwan, Anuj, Yeh, Ching-Feng, Hsu, Wei-Ning, Tomasello, Paden, Choi, Eunsol, Harwath, David, Mohamed, Abdelrahman
Format:	Conference Proceeding
Language:	English
Subjects:	Adaptation models ASR Benchmark testing Computational modeling Continual Learning Domain Adaptation On-Device Signal processing Signal processing algorithms Task analysis Training
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Automatic speech recognition research focuses on training and evaluating on static datasets. Yet, as speech models are increasingly deployed on personal devices, such models encounter user-specific distributional shifts. To simulate this real-world scenario, we introduce LibriContinual, a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks, with data corresponding to 118 individual speakers and 6 train splits per speaker of different sizes. Additionally, current speech recognition models and continual learning algorithms are not optimized to be compute-efficient. We adapt a general-purpose training algorithm NetAug for ASR and create a novel Conformer variant called the DisConformer (Disentangled Conformer). This algorithm produces ASR models consisting of a frozen 'core' network for general-purpose use and several tunable 'augment' networks for speaker-specific tuning. Using such models, we propose a novel compute-efficient continual learning algorithm called DisentangledCL. Our experiments show that the DisConformer models significantly outperform base-lines on general ASR i.e. LibriSpeech (15.58% rel. WER on test-other). On speaker-specific LibriContinual they significantly outper-form trainable-parameter-matched baselines (by 20.65% rel. WER on test) and even match fully finetuned baselines in some settings.
ISSN:	2379-190X
DOI:	10.1109/ICASSP49357.2023.10095484