Loading…

Speech Driven Video Editing via an Audio-Conditioned Diffusion Model

Taking inspiration from recent developments in visual generative tasks using diffusion models, we propose a method for end-to-end speech-driven video editing using a denoising diffusion model. Given a video of a talking person, and a separate auditory speech recording, the lip and jaw motions are re...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2023-05
Main Authors: Bigioi, Dan, Basak, Shubhajit, Stypułkowski, Michał, Zięba, Maciej, Jordan, Hugh, McDonnell, Rachel, Corcoran, Peter
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Bigioi, Dan
Basak, Shubhajit
Stypułkowski, Michał
Zięba, Maciej
Jordan, Hugh
McDonnell, Rachel
Corcoran, Peter
description Taking inspiration from recent developments in visual generative tasks using diffusion models, we propose a method for end-to-end speech-driven video editing using a denoising diffusion model. Given a video of a talking person, and a separate auditory speech recording, the lip and jaw motions are re-synchronized without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model on audio mel spectral features to generate synchronised facial motion. Proof of concept results are demonstrated on both single-speaker and multi-speaker video editing, providing a baseline model on the CREMA-D audiovisual data set. To the best of our knowledge, this is the first work to demonstrate and validate the feasibility of applying end-to-end denoising diffusion models to the task of audio-driven video editing.
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2765221738</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2765221738</sourcerecordid><originalsourceid>FETCH-proquest_journals_27652217383</originalsourceid><addsrcrecordid>eNqNis0KgkAURocgSMp3uNBa0Dv5sw012rQq2oo017oiMzbj-PwZ9ACtvsM530oEKGUSFQfEjQid6-M4xizHNJWBqK4j0eMFleWZNNxZkYFa8cT6CTO30Go4esUmKo3-aqNJQcVd593CcDGKhp1Yd-3gKPztVuxP9a08R6M1b09uanrjrV5Sg3mWIia5LOR_rw_WBjnQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2765221738</pqid></control><display><type>article</type><title>Speech Driven Video Editing via an Audio-Conditioned Diffusion Model</title><source>Publicly Available Content Database</source><creator>Bigioi, Dan ; Basak, Shubhajit ; Stypułkowski, Michał ; Zięba, Maciej ; Jordan, Hugh ; McDonnell, Rachel ; Corcoran, Peter</creator><creatorcontrib>Bigioi, Dan ; Basak, Shubhajit ; Stypułkowski, Michał ; Zięba, Maciej ; Jordan, Hugh ; McDonnell, Rachel ; Corcoran, Peter</creatorcontrib><description>Taking inspiration from recent developments in visual generative tasks using diffusion models, we propose a method for end-to-end speech-driven video editing using a denoising diffusion model. Given a video of a talking person, and a separate auditory speech recording, the lip and jaw motions are re-synchronized without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model on audio mel spectral features to generate synchronised facial motion. Proof of concept results are demonstrated on both single-speaker and multi-speaker video editing, providing a baseline model on the CREMA-D audiovisual data set. To the best of our knowledge, this is the first work to demonstrate and validate the feasibility of applying end-to-end denoising diffusion models to the task of audio-driven video editing.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Conditioning ; Diffusion ; Editing ; Lip reading ; Noise reduction ; Speech ; Three dimensional models</subject><ispartof>arXiv.org, 2023-05</ispartof><rights>2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2765221738?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Bigioi, Dan</creatorcontrib><creatorcontrib>Basak, Shubhajit</creatorcontrib><creatorcontrib>Stypułkowski, Michał</creatorcontrib><creatorcontrib>Zięba, Maciej</creatorcontrib><creatorcontrib>Jordan, Hugh</creatorcontrib><creatorcontrib>McDonnell, Rachel</creatorcontrib><creatorcontrib>Corcoran, Peter</creatorcontrib><title>Speech Driven Video Editing via an Audio-Conditioned Diffusion Model</title><title>arXiv.org</title><description>Taking inspiration from recent developments in visual generative tasks using diffusion models, we propose a method for end-to-end speech-driven video editing using a denoising diffusion model. Given a video of a talking person, and a separate auditory speech recording, the lip and jaw motions are re-synchronized without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model on audio mel spectral features to generate synchronised facial motion. Proof of concept results are demonstrated on both single-speaker and multi-speaker video editing, providing a baseline model on the CREMA-D audiovisual data set. To the best of our knowledge, this is the first work to demonstrate and validate the feasibility of applying end-to-end denoising diffusion models to the task of audio-driven video editing.</description><subject>Conditioning</subject><subject>Diffusion</subject><subject>Editing</subject><subject>Lip reading</subject><subject>Noise reduction</subject><subject>Speech</subject><subject>Three dimensional models</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNis0KgkAURocgSMp3uNBa0Dv5sw012rQq2oo017oiMzbj-PwZ9ACtvsM530oEKGUSFQfEjQid6-M4xizHNJWBqK4j0eMFleWZNNxZkYFa8cT6CTO30Go4esUmKo3-aqNJQcVd593CcDGKhp1Yd-3gKPztVuxP9a08R6M1b09uanrjrV5Sg3mWIia5LOR_rw_WBjnQ</recordid><startdate>20230511</startdate><enddate>20230511</enddate><creator>Bigioi, Dan</creator><creator>Basak, Shubhajit</creator><creator>Stypułkowski, Michał</creator><creator>Zięba, Maciej</creator><creator>Jordan, Hugh</creator><creator>McDonnell, Rachel</creator><creator>Corcoran, Peter</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20230511</creationdate><title>Speech Driven Video Editing via an Audio-Conditioned Diffusion Model</title><author>Bigioi, Dan ; Basak, Shubhajit ; Stypułkowski, Michał ; Zięba, Maciej ; Jordan, Hugh ; McDonnell, Rachel ; Corcoran, Peter</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_27652217383</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Conditioning</topic><topic>Diffusion</topic><topic>Editing</topic><topic>Lip reading</topic><topic>Noise reduction</topic><topic>Speech</topic><topic>Three dimensional models</topic><toplevel>online_resources</toplevel><creatorcontrib>Bigioi, Dan</creatorcontrib><creatorcontrib>Basak, Shubhajit</creatorcontrib><creatorcontrib>Stypułkowski, Michał</creatorcontrib><creatorcontrib>Zięba, Maciej</creatorcontrib><creatorcontrib>Jordan, Hugh</creatorcontrib><creatorcontrib>McDonnell, Rachel</creatorcontrib><creatorcontrib>Corcoran, Peter</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Bigioi, Dan</au><au>Basak, Shubhajit</au><au>Stypułkowski, Michał</au><au>Zięba, Maciej</au><au>Jordan, Hugh</au><au>McDonnell, Rachel</au><au>Corcoran, Peter</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Speech Driven Video Editing via an Audio-Conditioned Diffusion Model</atitle><jtitle>arXiv.org</jtitle><date>2023-05-11</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Taking inspiration from recent developments in visual generative tasks using diffusion models, we propose a method for end-to-end speech-driven video editing using a denoising diffusion model. Given a video of a talking person, and a separate auditory speech recording, the lip and jaw motions are re-synchronized without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model on audio mel spectral features to generate synchronised facial motion. Proof of concept results are demonstrated on both single-speaker and multi-speaker video editing, providing a baseline model on the CREMA-D audiovisual data set. To the best of our knowledge, this is the first work to demonstrate and validate the feasibility of applying end-to-end denoising diffusion models to the task of audio-driven video editing.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2023-05
issn 2331-8422
language eng
recordid cdi_proquest_journals_2765221738
source Publicly Available Content Database
subjects Conditioning
Diffusion
Editing
Lip reading
Noise reduction
Speech
Three dimensional models
title Speech Driven Video Editing via an Audio-Conditioned Diffusion Model
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T03%3A20%3A42IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Speech%20Driven%20Video%20Editing%20via%20an%20Audio-Conditioned%20Diffusion%20Model&rft.jtitle=arXiv.org&rft.au=Bigioi,%20Dan&rft.date=2023-05-11&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2765221738%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_27652217383%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2765221738&rft_id=info:pmid/&rfr_iscdi=true