Loading…
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder
In recent research, slight performance improvement is observed from automatic speech recognition systems to audio-visual speech recognition systems in end-to-end frameworks with low-quality videos. Unmatching convergence rates and specialized input representations between audio-visual modalities are...
Saved in:
Main Authors: | , , , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | In recent research, slight performance improvement is observed from automatic speech recognition systems to audio-visual speech recognition systems in end-to-end frameworks with low-quality videos. Unmatching convergence rates and specialized input representations between audio-visual modalities are considered to cause the problem. In this paper, we propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin through a frame-level subword unit classification task with visual streams as input. The fine-grained subword labels guide the network to capture temporal relationships between lip shapes and result in an accurate alignment between video and audio streams. Next, we propose an audio-guided Cross-Modal Fusion Encoder (CMFE) to utilize main training parameters for multiple cross-modal attention layers to make full use of modality complementarity. Experiments on the MISP2021-AVSR data set show the effectiveness of the two proposed techniques. Together, using only a relatively small amount of training data, the final system achieves better performances than state-of-the-art systems with more complex front-ends and back-ends. The code is released at 1 . |
---|---|
ISSN: | 1945-788X |
DOI: | 10.1109/ICME55011.2023.00447 |