Loading…

Talking Face Generation for Impression Conversion Considering Speech Semantics

This study investigates the talking face generation method to convert a speaker's video to give a target impression, such as "favorable" or "considerate". Such an impression conversion method needs to consider the input speech semantics because they affect the impression of...

Full description

Saved in:

Bibliographic Details
Main Authors:	Mizuno, Saki, Hojo, Nobukatsu, Shinoda, Kazutoshi, Suzuki, Keita, Ihori, Mana, Sato, Hiroshi, Tanaka, Tomohiro, Kawata, Naotaka, Kobashikawa, Satoshi, Masumura, Ryo
Format:	Conference Proceeding
Language:	English
Subjects:	Estimation Face recognition Impression Conversion Keypoint Rendering (computer graphics) Semantics Signal processing Speech recognition Talking Face Generation Vectors
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page	8415
container_issue
container_start_page	8411
container_title
container_volume
creator	Mizuno, Saki Hojo, Nobukatsu Shinoda, Kazutoshi Suzuki, Keita Ihori, Mana Sato, Hiroshi Tanaka, Tomohiro Kawata, Naotaka Kobashikawa, Satoshi Masumura, Ryo
description	This study investigates the talking face generation method to convert a speaker's video to give a target impression, such as "favorable" or "considerate". Such an impression conversion method needs to consider the input speech semantics because they affect the impression of a speaker's video along with the facial expression. Conventional emotional talking face generation methods utilize speech information to synchronize the lip and speech of the output video. However, they cannot consider speech semantics because the speech representations contain only phonetic information. To solve this problem, we propose a facial expression conversion model that uses a semantic vector obtained from BERT embeddings of speech recognition results of input speech. We first constructed an audio-visual dataset with impression labels assigned to each utterance. The evaluation results based on the dataset showed that the proposed method could improve the estimation accuracy of the facial expressions of the target video.
doi_str_mv	10.1109/ICASSP48485.2024.10446947
format	conference_proceeding
fullrecord	<record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10446947</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10446947</ieee_id><sourcerecordid>10446947</sourcerecordid><originalsourceid>FETCH-LOGICAL-i727-b840418e6664f182a91d9f13d081a7c8762bb30007786806e7df7fe5ed9ed6ad3</originalsourceid><addsrcrecordid>eNo1T81KxDAYjILguu4beKgP0Jq_5kuOUtwfWFRoD96WtPmi0W23JIvg29tF9zQzDDPMEHLPaMEYNQ-b6rGuX6WWuiw45bJgVEplJFyQhQGjRUmFnEx2SWZcgMmZoW_X5CalT0qpBqln5Lmx-68wvGdL22G2wgGjPYbDkPlDzDb9GDGlk6wOwzfGM03BYTyl6hGx-8hq7O1wDF26JVfe7hMu_nFOmuVTU63z7ctqmrvNA3DIWy2pZBqVUtIzza1hzngmHNXMQqdB8bYV00YArTRVCM6DxxKdQaesE3Ny91cbEHE3xtDb-LM73xe_12lQXg</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Talking Face Generation for Impression Conversion Considering Speech Semantics</title><source>IEEE Xplore All Conference Series</source><creator>Mizuno, Saki ; Hojo, Nobukatsu ; Shinoda, Kazutoshi ; Suzuki, Keita ; Ihori, Mana ; Sato, Hiroshi ; Tanaka, Tomohiro ; Kawata, Naotaka ; Kobashikawa, Satoshi ; Masumura, Ryo</creator><creatorcontrib>Mizuno, Saki ; Hojo, Nobukatsu ; Shinoda, Kazutoshi ; Suzuki, Keita ; Ihori, Mana ; Sato, Hiroshi ; Tanaka, Tomohiro ; Kawata, Naotaka ; Kobashikawa, Satoshi ; Masumura, Ryo</creatorcontrib><description>This study investigates the talking face generation method to convert a speaker's video to give a target impression, such as "favorable" or "considerate". Such an impression conversion method needs to consider the input speech semantics because they affect the impression of a speaker's video along with the facial expression. Conventional emotional talking face generation methods utilize speech information to synchronize the lip and speech of the output video. However, they cannot consider speech semantics because the speech representations contain only phonetic information. To solve this problem, we propose a facial expression conversion model that uses a semantic vector obtained from BERT embeddings of speech recognition results of input speech. We first constructed an audio-visual dataset with impression labels assigned to each utterance. The evaluation results based on the dataset showed that the proposed method could improve the estimation accuracy of the facial expressions of the target video.</description><identifier>EISSN: 2379-190X</identifier><identifier>EISBN: 9798350344851</identifier><identifier>DOI: 10.1109/ICASSP48485.2024.10446947</identifier><language>eng</language><publisher>IEEE</publisher><subject>Estimation ; Face recognition ; Impression Conversion ; Keypoint ; Rendering (computer graphics) ; Semantics ; Signal processing ; Speech recognition ; Talking Face Generation ; Vectors</subject><ispartof>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, p.8411-8415</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10446947$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,27902,54530,54907</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10446947$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Mizuno, Saki</creatorcontrib><creatorcontrib>Hojo, Nobukatsu</creatorcontrib><creatorcontrib>Shinoda, Kazutoshi</creatorcontrib><creatorcontrib>Suzuki, Keita</creatorcontrib><creatorcontrib>Ihori, Mana</creatorcontrib><creatorcontrib>Sato, Hiroshi</creatorcontrib><creatorcontrib>Tanaka, Tomohiro</creatorcontrib><creatorcontrib>Kawata, Naotaka</creatorcontrib><creatorcontrib>Kobashikawa, Satoshi</creatorcontrib><creatorcontrib>Masumura, Ryo</creatorcontrib><title>Talking Face Generation for Impression Conversion Considering Speech Semantics</title><title>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title><addtitle>ICASSP</addtitle><description>This study investigates the talking face generation method to convert a speaker's video to give a target impression, such as "favorable" or "considerate". Such an impression conversion method needs to consider the input speech semantics because they affect the impression of a speaker's video along with the facial expression. Conventional emotional talking face generation methods utilize speech information to synchronize the lip and speech of the output video. However, they cannot consider speech semantics because the speech representations contain only phonetic information. To solve this problem, we propose a facial expression conversion model that uses a semantic vector obtained from BERT embeddings of speech recognition results of input speech. We first constructed an audio-visual dataset with impression labels assigned to each utterance. The evaluation results based on the dataset showed that the proposed method could improve the estimation accuracy of the facial expressions of the target video.</description><subject>Estimation</subject><subject>Face recognition</subject><subject>Impression Conversion</subject><subject>Keypoint</subject><subject>Rendering (computer graphics)</subject><subject>Semantics</subject><subject>Signal processing</subject><subject>Speech recognition</subject><subject>Talking Face Generation</subject><subject>Vectors</subject><issn>2379-190X</issn><isbn>9798350344851</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2024</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNo1T81KxDAYjILguu4beKgP0Jq_5kuOUtwfWFRoD96WtPmi0W23JIvg29tF9zQzDDPMEHLPaMEYNQ-b6rGuX6WWuiw45bJgVEplJFyQhQGjRUmFnEx2SWZcgMmZoW_X5CalT0qpBqln5Lmx-68wvGdL22G2wgGjPYbDkPlDzDb9GDGlk6wOwzfGM03BYTyl6hGx-8hq7O1wDF26JVfe7hMu_nFOmuVTU63z7ctqmrvNA3DIWy2pZBqVUtIzza1hzngmHNXMQqdB8bYV00YArTRVCM6DxxKdQaesE3Ny91cbEHE3xtDb-LM73xe_12lQXg</recordid><startdate>20240414</startdate><enddate>20240414</enddate><creator>Mizuno, Saki</creator><creator>Hojo, Nobukatsu</creator><creator>Shinoda, Kazutoshi</creator><creator>Suzuki, Keita</creator><creator>Ihori, Mana</creator><creator>Sato, Hiroshi</creator><creator>Tanaka, Tomohiro</creator><creator>Kawata, Naotaka</creator><creator>Kobashikawa, Satoshi</creator><creator>Masumura, Ryo</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>20240414</creationdate><title>Talking Face Generation for Impression Conversion Considering Speech Semantics</title><author>Mizuno, Saki ; Hojo, Nobukatsu ; Shinoda, Kazutoshi ; Suzuki, Keita ; Ihori, Mana ; Sato, Hiroshi ; Tanaka, Tomohiro ; Kawata, Naotaka ; Kobashikawa, Satoshi ; Masumura, Ryo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i727-b840418e6664f182a91d9f13d081a7c8762bb30007786806e7df7fe5ed9ed6ad3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Estimation</topic><topic>Face recognition</topic><topic>Impression Conversion</topic><topic>Keypoint</topic><topic>Rendering (computer graphics)</topic><topic>Semantics</topic><topic>Signal processing</topic><topic>Speech recognition</topic><topic>Talking Face Generation</topic><topic>Vectors</topic><toplevel>online_resources</toplevel><creatorcontrib>Mizuno, Saki</creatorcontrib><creatorcontrib>Hojo, Nobukatsu</creatorcontrib><creatorcontrib>Shinoda, Kazutoshi</creatorcontrib><creatorcontrib>Suzuki, Keita</creatorcontrib><creatorcontrib>Ihori, Mana</creatorcontrib><creatorcontrib>Sato, Hiroshi</creatorcontrib><creatorcontrib>Tanaka, Tomohiro</creatorcontrib><creatorcontrib>Kawata, Naotaka</creatorcontrib><creatorcontrib>Kobashikawa, Satoshi</creatorcontrib><creatorcontrib>Masumura, Ryo</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Xplore</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Mizuno, Saki</au><au>Hojo, Nobukatsu</au><au>Shinoda, Kazutoshi</au><au>Suzuki, Keita</au><au>Ihori, Mana</au><au>Sato, Hiroshi</au><au>Tanaka, Tomohiro</au><au>Kawata, Naotaka</au><au>Kobashikawa, Satoshi</au><au>Masumura, Ryo</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Talking Face Generation for Impression Conversion Considering Speech Semantics</atitle><btitle>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</btitle><stitle>ICASSP</stitle><date>2024-04-14</date><risdate>2024</risdate><spage>8411</spage><epage>8415</epage><pages>8411-8415</pages><eissn>2379-190X</eissn><eisbn>9798350344851</eisbn><abstract>This study investigates the talking face generation method to convert a speaker's video to give a target impression, such as "favorable" or "considerate". Such an impression conversion method needs to consider the input speech semantics because they affect the impression of a speaker's video along with the facial expression. Conventional emotional talking face generation methods utilize speech information to synchronize the lip and speech of the output video. However, they cannot consider speech semantics because the speech representations contain only phonetic information. To solve this problem, we propose a facial expression conversion model that uses a semantic vector obtained from BERT embeddings of speech recognition results of input speech. We first constructed an audio-visual dataset with impression labels assigned to each utterance. The evaluation results based on the dataset showed that the proposed method could improve the estimation accuracy of the facial expressions of the target video.</abstract><pub>IEEE</pub><doi>10.1109/ICASSP48485.2024.10446947</doi><tpages>5</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	EISSN: 2379-190X
ispartof	ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, p.8411-8415
issn	2379-190X
language	eng
recordid	cdi_ieee_primary_10446947
source	IEEE Xplore All Conference Series
subjects	Estimation Face recognition Impression Conversion Keypoint Rendering (computer graphics) Semantics Signal processing Speech recognition Talking Face Generation Vectors
title	Talking Face Generation for Impression Conversion Considering Speech Semantics
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-31T18%3A05%3A25IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Talking%20Face%20Generation%20for%20Impression%20Conversion%20Considering%20Speech%20Semantics&rft.btitle=ICASSP%202024%20-%202024%20IEEE%20International%20Conference%20on%20Acoustics,%20Speech%20and%20Signal%20Processing%20(ICASSP)&rft.au=Mizuno,%20Saki&rft.date=2024-04-14&rft.spage=8411&rft.epage=8415&rft.pages=8411-8415&rft.eissn=2379-190X&rft_id=info:doi/10.1109/ICASSP48485.2024.10446947&rft.eisbn=9798350344851&rft_dat=%3Cieee_CHZPO%3E10446947%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i727-b840418e6664f182a91d9f13d081a7c8762bb30007786806e7df7fe5ed9ed6ad3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10446947&rfr_iscdi=true