Loading…

Stochastic Latent Talking Face Generation Toward Emotional Expressions and Head Poses

Current talking face generation methods have achieved promising lip-synchronization results, while still struggling to generate talking face video that exhibits emotional expressions and head poses. Studies in psychology have demonstrated that people may manifest diverse facial animations that follo...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on circuits and systems for video technology 2024-04, Vol.34 (4), p.2734-2748
Main Authors: Sheng, Zhicheng, Nie, Liqiang, Zhang, Min, Chang, Xiaojun, Yan, Yan
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c247t-e4d9eebbf9711b34b7c97a3e89d7e81696f4305bb237b67e557e07a3975f756d3
container_end_page 2748
container_issue 4
container_start_page 2734
container_title IEEE transactions on circuits and systems for video technology
container_volume 34
creator Sheng, Zhicheng
Nie, Liqiang
Zhang, Min
Chang, Xiaojun
Yan, Yan
description Current talking face generation methods have achieved promising lip-synchronization results, while still struggling to generate talking face video that exhibits emotional expressions and head poses. Studies in psychology have demonstrated that people may manifest diverse facial animations that follow a time-varying distribution. This presents two stochastic challenges that make generating appropriate emotional expressions and head poses difficult: (1) modelling the time-varying distribution of facial deformations to synthesize the stochastic dynamics of emotional expressions and head poses, and (2) estimating the complex motion distribution with given audio features to capture ambiguous audio-related expressions and head poses. To address the above issues, we present a Stochastic Latent talkIng face Generation mOdel (SLIGO), which builds a deep state space model (SSM) for talking face generation. The SLIGO model captures diverse and stochastic facial dynamics via the latent motion distribution. Additionally, we devise a dynamic variational autoencoder (DVAE) method to optimize the deep SSM model. This method decomposes the Evidence Lower BOund (ELBO) of SSM into three components: a posterior for latent motion encoding, a prior for audio-driven motion prediction, and a likelihood for talking face decoding. Furthermore, we propose a novel mixer continuous normalizing flow (CNF) module to model the complex facial motion prior distribution. Experimental results demonstrate that SLIGO outperforms existing methods and achieves state-of-the-art performance.
doi_str_mv 10.1109/TCSVT.2023.3311039
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TCSVT_2023_3311039</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10237279</ieee_id><sourcerecordid>3033618977</sourcerecordid><originalsourceid>FETCH-LOGICAL-c247t-e4d9eebbf9711b34b7c97a3e89d7e81696f4305bb237b67e557e07a3975f756d3</originalsourceid><addsrcrecordid>eNpNkEtLAzEUhYMoWKt_QFwEXE_NYzKZLKX0IRQUOnUbkpk7OrWd1CRF_femtgtX957LOZfDh9AtJSNKiXqoxsvXasQI4yPO04WrMzSgQpQZY0Scp50ImpWMikt0FcKaEJqXuRyg1TK6-t2E2NV4YSL0EVdm89H1b3hqasAz6MGb2LkeV-7L-AZPtu4gzQZPvnceQkgiYNM3eA6mwS8uQLhGF63ZBLg5zSFaTSfVeJ4tnmdP48dFVrNcxgzyRgFY2ypJqeW5lbWShkOpGgklLVTR5pwIaxmXtpAghASSDEqKVoqi4UN0f_y78-5zDyHqtdv71C1oTjgvaKmkTC52dNXeheCh1TvfbY3_0ZToAz79h08f8OkTvhS6O4Y6APgXSFWYVPwXvu5r3Q</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3033618977</pqid></control><display><type>article</type><title>Stochastic Latent Talking Face Generation Toward Emotional Expressions and Head Poses</title><source>IEEE Electronic Library (IEL) Journals</source><creator>Sheng, Zhicheng ; Nie, Liqiang ; Zhang, Min ; Chang, Xiaojun ; Yan, Yan</creator><creatorcontrib>Sheng, Zhicheng ; Nie, Liqiang ; Zhang, Min ; Chang, Xiaojun ; Yan, Yan</creatorcontrib><description>Current talking face generation methods have achieved promising lip-synchronization results, while still struggling to generate talking face video that exhibits emotional expressions and head poses. Studies in psychology have demonstrated that people may manifest diverse facial animations that follow a time-varying distribution. This presents two stochastic challenges that make generating appropriate emotional expressions and head poses difficult: (1) modelling the time-varying distribution of facial deformations to synthesize the stochastic dynamics of emotional expressions and head poses, and (2) estimating the complex motion distribution with given audio features to capture ambiguous audio-related expressions and head poses. To address the above issues, we present a Stochastic Latent talkIng face Generation mOdel (SLIGO), which builds a deep state space model (SSM) for talking face generation. The SLIGO model captures diverse and stochastic facial dynamics via the latent motion distribution. Additionally, we devise a dynamic variational autoencoder (DVAE) method to optimize the deep SSM model. This method decomposes the Evidence Lower BOund (ELBO) of SSM into three components: a posterior for latent motion encoding, a prior for audio-driven motion prediction, and a likelihood for talking face decoding. Furthermore, we propose a novel mixer continuous normalizing flow (CNF) module to model the complex facial motion prior distribution. Experimental results demonstrate that SLIGO outperforms existing methods and achieves state-of-the-art performance.</description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2023.3311039</identifier><identifier>CODEN: ITCTEM</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Computational modeling ; continuous normalizing flow ; dynamic variational autoencoders ; Dynamics ; Emotion recognition ; emotional expressions ; Face recognition ; Head ; Lower bounds ; Mixers ; State space models ; Stochastic processes ; Synchronism ; Synchronization ; Talking ; Talking face generation</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2024-04, Vol.34 (4), p.2734-2748</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c247t-e4d9eebbf9711b34b7c97a3e89d7e81696f4305bb237b67e557e07a3975f756d3</cites><orcidid>0000-0002-7778-8807 ; 0000-0002-3895-5510 ; 0000-0002-9245-3124 ; 0000-0003-1476-0273</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10237279$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,27901,27902,54771</link.rule.ids></links><search><creatorcontrib>Sheng, Zhicheng</creatorcontrib><creatorcontrib>Nie, Liqiang</creatorcontrib><creatorcontrib>Zhang, Min</creatorcontrib><creatorcontrib>Chang, Xiaojun</creatorcontrib><creatorcontrib>Yan, Yan</creatorcontrib><title>Stochastic Latent Talking Face Generation Toward Emotional Expressions and Head Poses</title><title>IEEE transactions on circuits and systems for video technology</title><addtitle>TCSVT</addtitle><description>Current talking face generation methods have achieved promising lip-synchronization results, while still struggling to generate talking face video that exhibits emotional expressions and head poses. Studies in psychology have demonstrated that people may manifest diverse facial animations that follow a time-varying distribution. This presents two stochastic challenges that make generating appropriate emotional expressions and head poses difficult: (1) modelling the time-varying distribution of facial deformations to synthesize the stochastic dynamics of emotional expressions and head poses, and (2) estimating the complex motion distribution with given audio features to capture ambiguous audio-related expressions and head poses. To address the above issues, we present a Stochastic Latent talkIng face Generation mOdel (SLIGO), which builds a deep state space model (SSM) for talking face generation. The SLIGO model captures diverse and stochastic facial dynamics via the latent motion distribution. Additionally, we devise a dynamic variational autoencoder (DVAE) method to optimize the deep SSM model. This method decomposes the Evidence Lower BOund (ELBO) of SSM into three components: a posterior for latent motion encoding, a prior for audio-driven motion prediction, and a likelihood for talking face decoding. Furthermore, we propose a novel mixer continuous normalizing flow (CNF) module to model the complex facial motion prior distribution. Experimental results demonstrate that SLIGO outperforms existing methods and achieves state-of-the-art performance.</description><subject>Computational modeling</subject><subject>continuous normalizing flow</subject><subject>dynamic variational autoencoders</subject><subject>Dynamics</subject><subject>Emotion recognition</subject><subject>emotional expressions</subject><subject>Face recognition</subject><subject>Head</subject><subject>Lower bounds</subject><subject>Mixers</subject><subject>State space models</subject><subject>Stochastic processes</subject><subject>Synchronism</subject><subject>Synchronization</subject><subject>Talking</subject><subject>Talking face generation</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNpNkEtLAzEUhYMoWKt_QFwEXE_NYzKZLKX0IRQUOnUbkpk7OrWd1CRF_femtgtX957LOZfDh9AtJSNKiXqoxsvXasQI4yPO04WrMzSgQpQZY0Scp50ImpWMikt0FcKaEJqXuRyg1TK6-t2E2NV4YSL0EVdm89H1b3hqasAz6MGb2LkeV-7L-AZPtu4gzQZPvnceQkgiYNM3eA6mwS8uQLhGF63ZBLg5zSFaTSfVeJ4tnmdP48dFVrNcxgzyRgFY2ypJqeW5lbWShkOpGgklLVTR5pwIaxmXtpAghASSDEqKVoqi4UN0f_y78-5zDyHqtdv71C1oTjgvaKmkTC52dNXeheCh1TvfbY3_0ZToAz79h08f8OkTvhS6O4Y6APgXSFWYVPwXvu5r3Q</recordid><startdate>20240401</startdate><enddate>20240401</enddate><creator>Sheng, Zhicheng</creator><creator>Nie, Liqiang</creator><creator>Zhang, Min</creator><creator>Chang, Xiaojun</creator><creator>Yan, Yan</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-7778-8807</orcidid><orcidid>https://orcid.org/0000-0002-3895-5510</orcidid><orcidid>https://orcid.org/0000-0002-9245-3124</orcidid><orcidid>https://orcid.org/0000-0003-1476-0273</orcidid></search><sort><creationdate>20240401</creationdate><title>Stochastic Latent Talking Face Generation Toward Emotional Expressions and Head Poses</title><author>Sheng, Zhicheng ; Nie, Liqiang ; Zhang, Min ; Chang, Xiaojun ; Yan, Yan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c247t-e4d9eebbf9711b34b7c97a3e89d7e81696f4305bb237b67e557e07a3975f756d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computational modeling</topic><topic>continuous normalizing flow</topic><topic>dynamic variational autoencoders</topic><topic>Dynamics</topic><topic>Emotion recognition</topic><topic>emotional expressions</topic><topic>Face recognition</topic><topic>Head</topic><topic>Lower bounds</topic><topic>Mixers</topic><topic>State space models</topic><topic>Stochastic processes</topic><topic>Synchronism</topic><topic>Synchronization</topic><topic>Talking</topic><topic>Talking face generation</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Sheng, Zhicheng</creatorcontrib><creatorcontrib>Nie, Liqiang</creatorcontrib><creatorcontrib>Zhang, Min</creatorcontrib><creatorcontrib>Chang, Xiaojun</creatorcontrib><creatorcontrib>Yan, Yan</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Sheng, Zhicheng</au><au>Nie, Liqiang</au><au>Zhang, Min</au><au>Chang, Xiaojun</au><au>Yan, Yan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Stochastic Latent Talking Face Generation Toward Emotional Expressions and Head Poses</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><stitle>TCSVT</stitle><date>2024-04-01</date><risdate>2024</risdate><volume>34</volume><issue>4</issue><spage>2734</spage><epage>2748</epage><pages>2734-2748</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><coden>ITCTEM</coden><abstract>Current talking face generation methods have achieved promising lip-synchronization results, while still struggling to generate talking face video that exhibits emotional expressions and head poses. Studies in psychology have demonstrated that people may manifest diverse facial animations that follow a time-varying distribution. This presents two stochastic challenges that make generating appropriate emotional expressions and head poses difficult: (1) modelling the time-varying distribution of facial deformations to synthesize the stochastic dynamics of emotional expressions and head poses, and (2) estimating the complex motion distribution with given audio features to capture ambiguous audio-related expressions and head poses. To address the above issues, we present a Stochastic Latent talkIng face Generation mOdel (SLIGO), which builds a deep state space model (SSM) for talking face generation. The SLIGO model captures diverse and stochastic facial dynamics via the latent motion distribution. Additionally, we devise a dynamic variational autoencoder (DVAE) method to optimize the deep SSM model. This method decomposes the Evidence Lower BOund (ELBO) of SSM into three components: a posterior for latent motion encoding, a prior for audio-driven motion prediction, and a likelihood for talking face decoding. Furthermore, we propose a novel mixer continuous normalizing flow (CNF) module to model the complex facial motion prior distribution. Experimental results demonstrate that SLIGO outperforms existing methods and achieves state-of-the-art performance.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TCSVT.2023.3311039</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0002-7778-8807</orcidid><orcidid>https://orcid.org/0000-0002-3895-5510</orcidid><orcidid>https://orcid.org/0000-0002-9245-3124</orcidid><orcidid>https://orcid.org/0000-0003-1476-0273</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 1051-8215
ispartof IEEE transactions on circuits and systems for video technology, 2024-04, Vol.34 (4), p.2734-2748
issn 1051-8215
1558-2205
language eng
recordid cdi_crossref_primary_10_1109_TCSVT_2023_3311039
source IEEE Electronic Library (IEL) Journals
subjects Computational modeling
continuous normalizing flow
dynamic variational autoencoders
Dynamics
Emotion recognition
emotional expressions
Face recognition
Head
Lower bounds
Mixers
State space models
Stochastic processes
Synchronism
Synchronization
Talking
Talking face generation
title Stochastic Latent Talking Face Generation Toward Emotional Expressions and Head Poses
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-01T00%3A54%3A30IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Stochastic%20Latent%20Talking%20Face%20Generation%20Toward%20Emotional%20Expressions%20and%20Head%20Poses&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Sheng,%20Zhicheng&rft.date=2024-04-01&rft.volume=34&rft.issue=4&rft.spage=2734&rft.epage=2748&rft.pages=2734-2748&rft.issn=1051-8215&rft.eissn=1558-2205&rft.coden=ITCTEM&rft_id=info:doi/10.1109/TCSVT.2023.3311039&rft_dat=%3Cproquest_cross%3E3033618977%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c247t-e4d9eebbf9711b34b7c97a3e89d7e81696f4305bb237b67e557e07a3975f756d3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3033618977&rft_id=info:pmid/&rft_ieee_id=10237279&rfr_iscdi=true