Loading…

Talking Face Generation for Impression Conversion Considering Speech Semantics

This study investigates the talking face generation method to convert a speaker's video to give a target impression, such as "favorable" or "considerate". Such an impression conversion method needs to consider the input speech semantics because they affect the impression of...

Full description

Saved in:
Bibliographic Details
Main Authors: Mizuno, Saki, Hojo, Nobukatsu, Shinoda, Kazutoshi, Suzuki, Keita, Ihori, Mana, Sato, Hiroshi, Tanaka, Tomohiro, Kawata, Naotaka, Kobashikawa, Satoshi, Masumura, Ryo
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page 8415
container_issue
container_start_page 8411
container_title
container_volume
creator Mizuno, Saki
Hojo, Nobukatsu
Shinoda, Kazutoshi
Suzuki, Keita
Ihori, Mana
Sato, Hiroshi
Tanaka, Tomohiro
Kawata, Naotaka
Kobashikawa, Satoshi
Masumura, Ryo
description This study investigates the talking face generation method to convert a speaker's video to give a target impression, such as "favorable" or "considerate". Such an impression conversion method needs to consider the input speech semantics because they affect the impression of a speaker's video along with the facial expression. Conventional emotional talking face generation methods utilize speech information to synchronize the lip and speech of the output video. However, they cannot consider speech semantics because the speech representations contain only phonetic information. To solve this problem, we propose a facial expression conversion model that uses a semantic vector obtained from BERT embeddings of speech recognition results of input speech. We first constructed an audio-visual dataset with impression labels assigned to each utterance. The evaluation results based on the dataset showed that the proposed method could improve the estimation accuracy of the facial expressions of the target video.
doi_str_mv 10.1109/ICASSP48485.2024.10446947
format conference_proceeding
fullrecord <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10446947</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10446947</ieee_id><sourcerecordid>10446947</sourcerecordid><originalsourceid>FETCH-LOGICAL-i727-b840418e6664f182a91d9f13d081a7c8762bb30007786806e7df7fe5ed9ed6ad3</originalsourceid><addsrcrecordid>eNo1T81KxDAYjILguu4beKgP0Jq_5kuOUtwfWFRoD96WtPmi0W23JIvg29tF9zQzDDPMEHLPaMEYNQ-b6rGuX6WWuiw45bJgVEplJFyQhQGjRUmFnEx2SWZcgMmZoW_X5CalT0qpBqln5Lmx-68wvGdL22G2wgGjPYbDkPlDzDb9GDGlk6wOwzfGM03BYTyl6hGx-8hq7O1wDF26JVfe7hMu_nFOmuVTU63z7ctqmrvNA3DIWy2pZBqVUtIzza1hzngmHNXMQqdB8bYV00YArTRVCM6DxxKdQaesE3Ny91cbEHE3xtDb-LM73xe_12lQXg</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Talking Face Generation for Impression Conversion Considering Speech Semantics</title><source>IEEE Xplore All Conference Series</source><creator>Mizuno, Saki ; Hojo, Nobukatsu ; Shinoda, Kazutoshi ; Suzuki, Keita ; Ihori, Mana ; Sato, Hiroshi ; Tanaka, Tomohiro ; Kawata, Naotaka ; Kobashikawa, Satoshi ; Masumura, Ryo</creator><creatorcontrib>Mizuno, Saki ; Hojo, Nobukatsu ; Shinoda, Kazutoshi ; Suzuki, Keita ; Ihori, Mana ; Sato, Hiroshi ; Tanaka, Tomohiro ; Kawata, Naotaka ; Kobashikawa, Satoshi ; Masumura, Ryo</creatorcontrib><description>This study investigates the talking face generation method to convert a speaker's video to give a target impression, such as "favorable" or "considerate". Such an impression conversion method needs to consider the input speech semantics because they affect the impression of a speaker's video along with the facial expression. Conventional emotional talking face generation methods utilize speech information to synchronize the lip and speech of the output video. However, they cannot consider speech semantics because the speech representations contain only phonetic information. To solve this problem, we propose a facial expression conversion model that uses a semantic vector obtained from BERT embeddings of speech recognition results of input speech. We first constructed an audio-visual dataset with impression labels assigned to each utterance. The evaluation results based on the dataset showed that the proposed method could improve the estimation accuracy of the facial expressions of the target video.</description><identifier>EISSN: 2379-190X</identifier><identifier>EISBN: 9798350344851</identifier><identifier>DOI: 10.1109/ICASSP48485.2024.10446947</identifier><language>eng</language><publisher>IEEE</publisher><subject>Estimation ; Face recognition ; Impression Conversion ; Keypoint ; Rendering (computer graphics) ; Semantics ; Signal processing ; Speech recognition ; Talking Face Generation ; Vectors</subject><ispartof>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, p.8411-8415</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10446947$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,27902,54530,54907</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10446947$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Mizuno, Saki</creatorcontrib><creatorcontrib>Hojo, Nobukatsu</creatorcontrib><creatorcontrib>Shinoda, Kazutoshi</creatorcontrib><creatorcontrib>Suzuki, Keita</creatorcontrib><creatorcontrib>Ihori, Mana</creatorcontrib><creatorcontrib>Sato, Hiroshi</creatorcontrib><creatorcontrib>Tanaka, Tomohiro</creatorcontrib><creatorcontrib>Kawata, Naotaka</creatorcontrib><creatorcontrib>Kobashikawa, Satoshi</creatorcontrib><creatorcontrib>Masumura, Ryo</creatorcontrib><title>Talking Face Generation for Impression Conversion Considering Speech Semantics</title><title>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title><addtitle>ICASSP</addtitle><description>This study investigates the talking face generation method to convert a speaker's video to give a target impression, such as "favorable" or "considerate". Such an impression conversion method needs to consider the input speech semantics because they affect the impression of a speaker's video along with the facial expression. Conventional emotional talking face generation methods utilize speech information to synchronize the lip and speech of the output video. However, they cannot consider speech semantics because the speech representations contain only phonetic information. To solve this problem, we propose a facial expression conversion model that uses a semantic vector obtained from BERT embeddings of speech recognition results of input speech. We first constructed an audio-visual dataset with impression labels assigned to each utterance. The evaluation results based on the dataset showed that the proposed method could improve the estimation accuracy of the facial expressions of the target video.</description><subject>Estimation</subject><subject>Face recognition</subject><subject>Impression Conversion</subject><subject>Keypoint</subject><subject>Rendering (computer graphics)</subject><subject>Semantics</subject><subject>Signal processing</subject><subject>Speech recognition</subject><subject>Talking Face Generation</subject><subject>Vectors</subject><issn>2379-190X</issn><isbn>9798350344851</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2024</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNo1T81KxDAYjILguu4beKgP0Jq_5kuOUtwfWFRoD96WtPmi0W23JIvg29tF9zQzDDPMEHLPaMEYNQ-b6rGuX6WWuiw45bJgVEplJFyQhQGjRUmFnEx2SWZcgMmZoW_X5CalT0qpBqln5Lmx-68wvGdL22G2wgGjPYbDkPlDzDb9GDGlk6wOwzfGM03BYTyl6hGx-8hq7O1wDF26JVfe7hMu_nFOmuVTU63z7ctqmrvNA3DIWy2pZBqVUtIzza1hzngmHNXMQqdB8bYV00YArTRVCM6DxxKdQaesE3Ny91cbEHE3xtDb-LM73xe_12lQXg</recordid><startdate>20240414</startdate><enddate>20240414</enddate><creator>Mizuno, Saki</creator><creator>Hojo, Nobukatsu</creator><creator>Shinoda, Kazutoshi</creator><creator>Suzuki, Keita</creator><creator>Ihori, Mana</creator><creator>Sato, Hiroshi</creator><creator>Tanaka, Tomohiro</creator><creator>Kawata, Naotaka</creator><creator>Kobashikawa, Satoshi</creator><creator>Masumura, Ryo</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>20240414</creationdate><title>Talking Face Generation for Impression Conversion Considering Speech Semantics</title><author>Mizuno, Saki ; Hojo, Nobukatsu ; Shinoda, Kazutoshi ; Suzuki, Keita ; Ihori, Mana ; Sato, Hiroshi ; Tanaka, Tomohiro ; Kawata, Naotaka ; Kobashikawa, Satoshi ; Masumura, Ryo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i727-b840418e6664f182a91d9f13d081a7c8762bb30007786806e7df7fe5ed9ed6ad3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Estimation</topic><topic>Face recognition</topic><topic>Impression Conversion</topic><topic>Keypoint</topic><topic>Rendering (computer graphics)</topic><topic>Semantics</topic><topic>Signal processing</topic><topic>Speech recognition</topic><topic>Talking Face Generation</topic><topic>Vectors</topic><toplevel>online_resources</toplevel><creatorcontrib>Mizuno, Saki</creatorcontrib><creatorcontrib>Hojo, Nobukatsu</creatorcontrib><creatorcontrib>Shinoda, Kazutoshi</creatorcontrib><creatorcontrib>Suzuki, Keita</creatorcontrib><creatorcontrib>Ihori, Mana</creatorcontrib><creatorcontrib>Sato, Hiroshi</creatorcontrib><creatorcontrib>Tanaka, Tomohiro</creatorcontrib><creatorcontrib>Kawata, Naotaka</creatorcontrib><creatorcontrib>Kobashikawa, Satoshi</creatorcontrib><creatorcontrib>Masumura, Ryo</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Xplore</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Mizuno, Saki</au><au>Hojo, Nobukatsu</au><au>Shinoda, Kazutoshi</au><au>Suzuki, Keita</au><au>Ihori, Mana</au><au>Sato, Hiroshi</au><au>Tanaka, Tomohiro</au><au>Kawata, Naotaka</au><au>Kobashikawa, Satoshi</au><au>Masumura, Ryo</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Talking Face Generation for Impression Conversion Considering Speech Semantics</atitle><btitle>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</btitle><stitle>ICASSP</stitle><date>2024-04-14</date><risdate>2024</risdate><spage>8411</spage><epage>8415</epage><pages>8411-8415</pages><eissn>2379-190X</eissn><eisbn>9798350344851</eisbn><abstract>This study investigates the talking face generation method to convert a speaker's video to give a target impression, such as "favorable" or "considerate". Such an impression conversion method needs to consider the input speech semantics because they affect the impression of a speaker's video along with the facial expression. Conventional emotional talking face generation methods utilize speech information to synchronize the lip and speech of the output video. However, they cannot consider speech semantics because the speech representations contain only phonetic information. To solve this problem, we propose a facial expression conversion model that uses a semantic vector obtained from BERT embeddings of speech recognition results of input speech. We first constructed an audio-visual dataset with impression labels assigned to each utterance. The evaluation results based on the dataset showed that the proposed method could improve the estimation accuracy of the facial expressions of the target video.</abstract><pub>IEEE</pub><doi>10.1109/ICASSP48485.2024.10446947</doi><tpages>5</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier EISSN: 2379-190X
ispartof ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, p.8411-8415
issn 2379-190X
language eng
recordid cdi_ieee_primary_10446947
source IEEE Xplore All Conference Series
subjects Estimation
Face recognition
Impression Conversion
Keypoint
Rendering (computer graphics)
Semantics
Signal processing
Speech recognition
Talking Face Generation
Vectors
title Talking Face Generation for Impression Conversion Considering Speech Semantics
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-31T18%3A05%3A25IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Talking%20Face%20Generation%20for%20Impression%20Conversion%20Considering%20Speech%20Semantics&rft.btitle=ICASSP%202024%20-%202024%20IEEE%20International%20Conference%20on%20Acoustics,%20Speech%20and%20Signal%20Processing%20(ICASSP)&rft.au=Mizuno,%20Saki&rft.date=2024-04-14&rft.spage=8411&rft.epage=8415&rft.pages=8411-8415&rft.eissn=2379-190X&rft_id=info:doi/10.1109/ICASSP48485.2024.10446947&rft.eisbn=9798350344851&rft_dat=%3Cieee_CHZPO%3E10446947%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i727-b840418e6664f182a91d9f13d081a7c8762bb30007786806e7df7fe5ed9ed6ad3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10446947&rfr_iscdi=true