Loading…

Audio-visual speech synthesis using vision transformer–enhanced autoencoders with ensemble of loss functions

Audio-visual speech synthesis (AVSS) has garnered attention in recent years for its utility in the realm of audio-visual learning. AVSS transforms one speaker’s speech into another’s audio-visual stream while retaining linguistic content. This approach extends existing AVSS methods by first modifyin...

Full description

Saved in:
Bibliographic Details
Published in:Applied intelligence (Dordrecht, Netherlands) Netherlands), 2024-03, Vol.54 (6), p.4507-4524
Main Authors: Ghosh, Subhayu, Sarkar, Snehashis, Ghosh, Sovan, Zalkow, Frank, Jana, Nanda Dulal
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Audio-visual speech synthesis (AVSS) has garnered attention in recent years for its utility in the realm of audio-visual learning. AVSS transforms one speaker’s speech into another’s audio-visual stream while retaining linguistic content. This approach extends existing AVSS methods by first modifying vocal features from the source to the target speaker, akin to voice conversion (VC), and then synthesizing the audio-visual stream for the target speaker, termed audio-visual synthesis (AVS). In this work, a novel AVSS approach is proposed using vision transformer (ViT)-based Autoencoders (AEs), enriched with a combination of cycle consistency and reconstruction loss functions, with the aim of enhancing synthesis quality. Leveraging ViT’s attention mechanism, this method effectively captures spectral and temporal features from input speech. The combination of cycle consistency and reconstruction loss improves synthesis quality and aids in preserving essential information. The proposed framework is trained and tested on benchmark datasets, and compared extensively with state-of-the-art (SOTA) methods. The experimental results demonstrate the superiority of the proposed approach over existing SOTA models, in terms of quality and intelligibility for AVSS, indicating the potential for real-world applications. Graphical abstract
ISSN:0924-669X
1573-7497
DOI:10.1007/s10489-024-05380-7