Loading…

A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model

The cocktail party problem can be more effectively addressed by leveraging the speaker’s visual and audio information. This paper proposes a method to improve the audio’s separation using two visual cues: facial features and lip movement. Firstly, residual connections are introduced in the audio sep...

Full description

Saved in:
Bibliographic Details
Published in:Sensors (Basel, Switzerland) Switzerland), 2023-10, Vol.23 (21), p.8770
Main Authors: Li, Guizhu, Fu, Min, Sun, Mengnan, Liu, Xuefeng, Zheng, Bing
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The cocktail party problem can be more effectively addressed by leveraging the speaker’s visual and audio information. This paper proposes a method to improve the audio’s separation using two visual cues: facial features and lip movement. Firstly, residual connections are introduced in the audio separation module to extract detailed features. Secondly, considering the video stream contains information other than the face, which has a minimal correlation with the audio, an attention mechanism is employed in the face module to focus on crucial information. Then, the loss function considers the audio-visual similarity to take advantage of the relationship between audio and visual completely. Experimental results on the public VoxCeleb2 dataset show that the proposed model significantly enhanced SDR, PSEQ, and STOI, especially 4 dB improvements in SDR.
ISSN:1424-8220
1424-8220
DOI:10.3390/s23218770