Loading…

Multiaccent EMG-to-Speech Optimized Transduction With PerFL and MAML Adaptations

Silent speech voicing enables individuals with speech impairments to communicate solely through facial muscle movements, bypassing the need for vocalization. Typically, electromyography (EMG) is utilized in conjunction with voice signals from individuals with normal speech for training purposes. Exi...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on instrumentation and measurement 2024, Vol.73, p.1-17
Main Authors: Ullah, Shan, Kim, Deok-Hwan
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Silent speech voicing enables individuals with speech impairments to communicate solely through facial muscle movements, bypassing the need for vocalization. Typically, electromyography (EMG) is utilized in conjunction with voice signals from individuals with normal speech for training purposes. Existing studies are targeting single accent using a single acquisition device, ignoring multiple accents from diverse ethnic backgrounds which can pose challenges in developing generalized and adaptive solutions. To address this, we propose a comprehensive approach consisting of the following: 1) a multiaccent EMG-to-speech silent voicing dataset; 2) an optimized transduction model (EMG-to-speech features); 3) a model-agnostic meta-learning (MAML) approach to adapt across cross-accented data; and 4) a personalized federated learning (PerFL) solution that utilizes MAML initialization to enhance global model convergence. Our novel transduction model incorporates three key elements: 1) convolution layers with a Squeeze-and-Excitation network to enhance channel-wise interdependencies (feature recalibration); 2) a gating multilayer perceptron to enhance global context awareness by linear projections along channel dimensions; and 3) transformers that learn temporal features across time series (EMG). We validated our novel algorithm using publicly available and proprietary (from our research laboratory) datasets. To simulate real-world conditions, a proprietary dataset was generated using three different biosignal devices, yielding heterogeneous data with 1370 utterances involving eight subjects with three distinct accents. Our proposed transduction model outperformed traditional methods, with 1.3%-3.5% improvements in the word error rate (WER) on the public dataset. Moreover, we studied the impact of two different MAML variants and their impact on PerFL initialization. Detailed results, encompassing various performance metrics such as confusability, accuracy, character-error-rate (CER), and WER, are presented for both public and proprietary datasets.
ISSN:0018-9456
1557-9662
DOI:10.1109/TIM.2024.3449948