Loading…
Talking Face Generation for Impression Conversion Considering Speech Semantics
This study investigates the talking face generation method to convert a speaker's video to give a target impression, such as "favorable" or "considerate". Such an impression conversion method needs to consider the input speech semantics because they affect the impression of...
Saved in:
Main Authors: | , , , , , , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | 8415 |
container_issue | |
container_start_page | 8411 |
container_title | |
container_volume | |
creator | Mizuno, Saki Hojo, Nobukatsu Shinoda, Kazutoshi Suzuki, Keita Ihori, Mana Sato, Hiroshi Tanaka, Tomohiro Kawata, Naotaka Kobashikawa, Satoshi Masumura, Ryo |
description | This study investigates the talking face generation method to convert a speaker's video to give a target impression, such as "favorable" or "considerate". Such an impression conversion method needs to consider the input speech semantics because they affect the impression of a speaker's video along with the facial expression. Conventional emotional talking face generation methods utilize speech information to synchronize the lip and speech of the output video. However, they cannot consider speech semantics because the speech representations contain only phonetic information. To solve this problem, we propose a facial expression conversion model that uses a semantic vector obtained from BERT embeddings of speech recognition results of input speech. We first constructed an audio-visual dataset with impression labels assigned to each utterance. The evaluation results based on the dataset showed that the proposed method could improve the estimation accuracy of the facial expressions of the target video. |
doi_str_mv | 10.1109/ICASSP48485.2024.10446947 |
format | conference_proceeding |
fullrecord | <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10446947</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10446947</ieee_id><sourcerecordid>10446947</sourcerecordid><originalsourceid>FETCH-LOGICAL-i727-b840418e6664f182a91d9f13d081a7c8762bb30007786806e7df7fe5ed9ed6ad3</originalsourceid><addsrcrecordid>eNo1T81KxDAYjILguu4beKgP0Jq_5kuOUtwfWFRoD96WtPmi0W23JIvg29tF9zQzDDPMEHLPaMEYNQ-b6rGuX6WWuiw45bJgVEplJFyQhQGjRUmFnEx2SWZcgMmZoW_X5CalT0qpBqln5Lmx-68wvGdL22G2wgGjPYbDkPlDzDb9GDGlk6wOwzfGM03BYTyl6hGx-8hq7O1wDF26JVfe7hMu_nFOmuVTU63z7ctqmrvNA3DIWy2pZBqVUtIzza1hzngmHNXMQqdB8bYV00YArTRVCM6DxxKdQaesE3Ny91cbEHE3xtDb-LM73xe_12lQXg</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Talking Face Generation for Impression Conversion Considering Speech Semantics</title><source>IEEE Xplore All Conference Series</source><creator>Mizuno, Saki ; Hojo, Nobukatsu ; Shinoda, Kazutoshi ; Suzuki, Keita ; Ihori, Mana ; Sato, Hiroshi ; Tanaka, Tomohiro ; Kawata, Naotaka ; Kobashikawa, Satoshi ; Masumura, Ryo</creator><creatorcontrib>Mizuno, Saki ; Hojo, Nobukatsu ; Shinoda, Kazutoshi ; Suzuki, Keita ; Ihori, Mana ; Sato, Hiroshi ; Tanaka, Tomohiro ; Kawata, Naotaka ; Kobashikawa, Satoshi ; Masumura, Ryo</creatorcontrib><description>This study investigates the talking face generation method to convert a speaker's video to give a target impression, such as "favorable" or "considerate". Such an impression conversion method needs to consider the input speech semantics because they affect the impression of a speaker's video along with the facial expression. Conventional emotional talking face generation methods utilize speech information to synchronize the lip and speech of the output video. However, they cannot consider speech semantics because the speech representations contain only phonetic information. To solve this problem, we propose a facial expression conversion model that uses a semantic vector obtained from BERT embeddings of speech recognition results of input speech. We first constructed an audio-visual dataset with impression labels assigned to each utterance. The evaluation results based on the dataset showed that the proposed method could improve the estimation accuracy of the facial expressions of the target video.</description><identifier>EISSN: 2379-190X</identifier><identifier>EISBN: 9798350344851</identifier><identifier>DOI: 10.1109/ICASSP48485.2024.10446947</identifier><language>eng</language><publisher>IEEE</publisher><subject>Estimation ; Face recognition ; Impression Conversion ; Keypoint ; Rendering (computer graphics) ; Semantics ; Signal processing ; Speech recognition ; Talking Face Generation ; Vectors</subject><ispartof>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, p.8411-8415</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10446947$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,27902,54530,54907</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10446947$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Mizuno, Saki</creatorcontrib><creatorcontrib>Hojo, Nobukatsu</creatorcontrib><creatorcontrib>Shinoda, Kazutoshi</creatorcontrib><creatorcontrib>Suzuki, Keita</creatorcontrib><creatorcontrib>Ihori, Mana</creatorcontrib><creatorcontrib>Sato, Hiroshi</creatorcontrib><creatorcontrib>Tanaka, Tomohiro</creatorcontrib><creatorcontrib>Kawata, Naotaka</creatorcontrib><creatorcontrib>Kobashikawa, Satoshi</creatorcontrib><creatorcontrib>Masumura, Ryo</creatorcontrib><title>Talking Face Generation for Impression Conversion Considering Speech Semantics</title><title>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title><addtitle>ICASSP</addtitle><description>This study investigates the talking face generation method to convert a speaker's video to give a target impression, such as "favorable" or "considerate". Such an impression conversion method needs to consider the input speech semantics because they affect the impression of a speaker's video along with the facial expression. Conventional emotional talking face generation methods utilize speech information to synchronize the lip and speech of the output video. However, they cannot consider speech semantics because the speech representations contain only phonetic information. To solve this problem, we propose a facial expression conversion model that uses a semantic vector obtained from BERT embeddings of speech recognition results of input speech. We first constructed an audio-visual dataset with impression labels assigned to each utterance. The evaluation results based on the dataset showed that the proposed method could improve the estimation accuracy of the facial expressions of the target video.</description><subject>Estimation</subject><subject>Face recognition</subject><subject>Impression Conversion</subject><subject>Keypoint</subject><subject>Rendering (computer graphics)</subject><subject>Semantics</subject><subject>Signal processing</subject><subject>Speech recognition</subject><subject>Talking Face Generation</subject><subject>Vectors</subject><issn>2379-190X</issn><isbn>9798350344851</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2024</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNo1T81KxDAYjILguu4beKgP0Jq_5kuOUtwfWFRoD96WtPmi0W23JIvg29tF9zQzDDPMEHLPaMEYNQ-b6rGuX6WWuiw45bJgVEplJFyQhQGjRUmFnEx2SWZcgMmZoW_X5CalT0qpBqln5Lmx-68wvGdL22G2wgGjPYbDkPlDzDb9GDGlk6wOwzfGM03BYTyl6hGx-8hq7O1wDF26JVfe7hMu_nFOmuVTU63z7ctqmrvNA3DIWy2pZBqVUtIzza1hzngmHNXMQqdB8bYV00YArTRVCM6DxxKdQaesE3Ny91cbEHE3xtDb-LM73xe_12lQXg</recordid><startdate>20240414</startdate><enddate>20240414</enddate><creator>Mizuno, Saki</creator><creator>Hojo, Nobukatsu</creator><creator>Shinoda, Kazutoshi</creator><creator>Suzuki, Keita</creator><creator>Ihori, Mana</creator><creator>Sato, Hiroshi</creator><creator>Tanaka, Tomohiro</creator><creator>Kawata, Naotaka</creator><creator>Kobashikawa, Satoshi</creator><creator>Masumura, Ryo</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>20240414</creationdate><title>Talking Face Generation for Impression Conversion Considering Speech Semantics</title><author>Mizuno, Saki ; Hojo, Nobukatsu ; Shinoda, Kazutoshi ; Suzuki, Keita ; Ihori, Mana ; Sato, Hiroshi ; Tanaka, Tomohiro ; Kawata, Naotaka ; Kobashikawa, Satoshi ; Masumura, Ryo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i727-b840418e6664f182a91d9f13d081a7c8762bb30007786806e7df7fe5ed9ed6ad3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Estimation</topic><topic>Face recognition</topic><topic>Impression Conversion</topic><topic>Keypoint</topic><topic>Rendering (computer graphics)</topic><topic>Semantics</topic><topic>Signal processing</topic><topic>Speech recognition</topic><topic>Talking Face Generation</topic><topic>Vectors</topic><toplevel>online_resources</toplevel><creatorcontrib>Mizuno, Saki</creatorcontrib><creatorcontrib>Hojo, Nobukatsu</creatorcontrib><creatorcontrib>Shinoda, Kazutoshi</creatorcontrib><creatorcontrib>Suzuki, Keita</creatorcontrib><creatorcontrib>Ihori, Mana</creatorcontrib><creatorcontrib>Sato, Hiroshi</creatorcontrib><creatorcontrib>Tanaka, Tomohiro</creatorcontrib><creatorcontrib>Kawata, Naotaka</creatorcontrib><creatorcontrib>Kobashikawa, Satoshi</creatorcontrib><creatorcontrib>Masumura, Ryo</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Xplore</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Mizuno, Saki</au><au>Hojo, Nobukatsu</au><au>Shinoda, Kazutoshi</au><au>Suzuki, Keita</au><au>Ihori, Mana</au><au>Sato, Hiroshi</au><au>Tanaka, Tomohiro</au><au>Kawata, Naotaka</au><au>Kobashikawa, Satoshi</au><au>Masumura, Ryo</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Talking Face Generation for Impression Conversion Considering Speech Semantics</atitle><btitle>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</btitle><stitle>ICASSP</stitle><date>2024-04-14</date><risdate>2024</risdate><spage>8411</spage><epage>8415</epage><pages>8411-8415</pages><eissn>2379-190X</eissn><eisbn>9798350344851</eisbn><abstract>This study investigates the talking face generation method to convert a speaker's video to give a target impression, such as "favorable" or "considerate". Such an impression conversion method needs to consider the input speech semantics because they affect the impression of a speaker's video along with the facial expression. Conventional emotional talking face generation methods utilize speech information to synchronize the lip and speech of the output video. However, they cannot consider speech semantics because the speech representations contain only phonetic information. To solve this problem, we propose a facial expression conversion model that uses a semantic vector obtained from BERT embeddings of speech recognition results of input speech. We first constructed an audio-visual dataset with impression labels assigned to each utterance. The evaluation results based on the dataset showed that the proposed method could improve the estimation accuracy of the facial expressions of the target video.</abstract><pub>IEEE</pub><doi>10.1109/ICASSP48485.2024.10446947</doi><tpages>5</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | EISSN: 2379-190X |
ispartof | ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, p.8411-8415 |
issn | 2379-190X |
language | eng |
recordid | cdi_ieee_primary_10446947 |
source | IEEE Xplore All Conference Series |
subjects | Estimation Face recognition Impression Conversion Keypoint Rendering (computer graphics) Semantics Signal processing Speech recognition Talking Face Generation Vectors |
title | Talking Face Generation for Impression Conversion Considering Speech Semantics |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-31T18%3A05%3A25IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Talking%20Face%20Generation%20for%20Impression%20Conversion%20Considering%20Speech%20Semantics&rft.btitle=ICASSP%202024%20-%202024%20IEEE%20International%20Conference%20on%20Acoustics,%20Speech%20and%20Signal%20Processing%20(ICASSP)&rft.au=Mizuno,%20Saki&rft.date=2024-04-14&rft.spage=8411&rft.epage=8415&rft.pages=8411-8415&rft.eissn=2379-190X&rft_id=info:doi/10.1109/ICASSP48485.2024.10446947&rft.eisbn=9798350344851&rft_dat=%3Cieee_CHZPO%3E10446947%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i727-b840418e6664f182a91d9f13d081a7c8762bb30007786806e7df7fe5ed9ed6ad3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10446947&rfr_iscdi=true |