Loading…

Talking Face Generation With Audio-Deduced Emotional Landmarks

The goal of talking face generation is to synthesize a sequence of face images of the specified identity, ensuring the mouth movements are synchronized with the given audio. Recently, image-based talking face generation has emerged as a popular approach. It could generate talking face images synchro...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transaction on neural networks and learning systems 2024-10, Vol.35 (10), p.14099-14111
Main Authors: Zhai, Shuyan, Liu, Meng, Li, Yongqiang, Gao, Zan, Zhu, Lei, Nie, Liqiang
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The goal of talking face generation is to synthesize a sequence of face images of the specified identity, ensuring the mouth movements are synchronized with the given audio. Recently, image-based talking face generation has emerged as a popular approach. It could generate talking face images synchronized with the audio merely depending on a facial image of arbitrary identity and an audio clip. Despite the accessible input, it forgoes the exploitation of the audio emotion, inducing the generated faces to suffer from emotion unsynchronization, mouth inaccuracy, and image quality deficiency. In this article, we build a bistage audio emotion-aware talking face generation (AMIGO) framework, to generate high-quality talking face videos with cross-modally synced emotion. Specifically, we propose a sequence-to-sequence (seq2seq) cross-modal emotional landmark generation network to generate vivid landmarks, whose lip and emotion are both synchronized with input audio. Meantime, we utilize a coordinated visual emotion representation to improve the extraction of the audio one. In stage two, a feature-adaptive visual translation network is designed to translate the synthesized landmarks into facial images. Concretely, we proposed a feature-adaptive transformation module to fuse the high-level representations of landmarks and images, resulting in significant improvement in image quality. We perform extensive experiments on the multi-view emotional audio-visual dataset (MEAD) and crowd-sourced emotional multimodal actors dataset (CREMA-D) benchmark datasets, demonstrating that our model outperforms state-of-the-art benchmarks.
ISSN:2162-237X
2162-2388
2162-2388
DOI:10.1109/TNNLS.2023.3274676