Loading…
Stochastic Latent Talking Face Generation Toward Emotional Expressions and Head Poses
Current talking face generation methods have achieved promising lip-synchronization results, while still struggling to generate talking face video that exhibits emotional expressions and head poses. Studies in psychology have demonstrated that people may manifest diverse facial animations that follo...
Saved in:
Published in: | IEEE transactions on circuits and systems for video technology 2024-04, Vol.34 (4), p.2734-2748 |
---|---|
Main Authors: | , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Current talking face generation methods have achieved promising lip-synchronization results, while still struggling to generate talking face video that exhibits emotional expressions and head poses. Studies in psychology have demonstrated that people may manifest diverse facial animations that follow a time-varying distribution. This presents two stochastic challenges that make generating appropriate emotional expressions and head poses difficult: (1) modelling the time-varying distribution of facial deformations to synthesize the stochastic dynamics of emotional expressions and head poses, and (2) estimating the complex motion distribution with given audio features to capture ambiguous audio-related expressions and head poses. To address the above issues, we present a Stochastic Latent talkIng face Generation mOdel (SLIGO), which builds a deep state space model (SSM) for talking face generation. The SLIGO model captures diverse and stochastic facial dynamics via the latent motion distribution. Additionally, we devise a dynamic variational autoencoder (DVAE) method to optimize the deep SSM model. This method decomposes the Evidence Lower BOund (ELBO) of SSM into three components: a posterior for latent motion encoding, a prior for audio-driven motion prediction, and a likelihood for talking face decoding. Furthermore, we propose a novel mixer continuous normalizing flow (CNF) module to model the complex facial motion prior distribution. Experimental results demonstrate that SLIGO outperforms existing methods and achieves state-of-the-art performance. |
---|---|
ISSN: | 1051-8215 1558-2205 |
DOI: | 10.1109/TCSVT.2023.3311039 |