Loading…
Multiaccent EMG-to-Speech Optimized Transduction With PerFL and MAML Adaptations
Silent speech voicing enables individuals with speech impairments to communicate solely through facial muscle movements, bypassing the need for vocalization. Typically, electromyography (EMG) is utilized in conjunction with voice signals from individuals with normal speech for training purposes. Exi...
Saved in:
Published in: | IEEE transactions on instrumentation and measurement 2024, Vol.73, p.1-17 |
---|---|
Main Authors: | , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Silent speech voicing enables individuals with speech impairments to communicate solely through facial muscle movements, bypassing the need for vocalization. Typically, electromyography (EMG) is utilized in conjunction with voice signals from individuals with normal speech for training purposes. Existing studies are targeting single accent using a single acquisition device, ignoring multiple accents from diverse ethnic backgrounds which can pose challenges in developing generalized and adaptive solutions. To address this, we propose a comprehensive approach consisting of the following: 1) a multiaccent EMG-to-speech silent voicing dataset; 2) an optimized transduction model (EMG-to-speech features); 3) a model-agnostic meta-learning (MAML) approach to adapt across cross-accented data; and 4) a personalized federated learning (PerFL) solution that utilizes MAML initialization to enhance global model convergence. Our novel transduction model incorporates three key elements: 1) convolution layers with a Squeeze-and-Excitation network to enhance channel-wise interdependencies (feature recalibration); 2) a gating multilayer perceptron to enhance global context awareness by linear projections along channel dimensions; and 3) transformers that learn temporal features across time series (EMG). We validated our novel algorithm using publicly available and proprietary (from our research laboratory) datasets. To simulate real-world conditions, a proprietary dataset was generated using three different biosignal devices, yielding heterogeneous data with 1370 utterances involving eight subjects with three distinct accents. Our proposed transduction model outperformed traditional methods, with 1.3%-3.5% improvements in the word error rate (WER) on the public dataset. Moreover, we studied the impact of two different MAML variants and their impact on PerFL initialization. Detailed results, encompassing various performance metrics such as confusability, accuracy, character-error-rate (CER), and WER, are presented for both public and proprietary datasets. |
---|---|
ISSN: | 0018-9456 1557-9662 |
DOI: | 10.1109/TIM.2024.3449948 |