Loading…

Speech driven video editing via an audio-conditioned diffusion model

Taking inspiration from recent developments in visual generative tasks using diffusion models, we propose a method for end-to-end speech-driven video editing using a denoising diffusion model. Given a video of a talking person, and a separate auditory speech recording, the lip and jaw motions are re...

Full description

Saved in:

Bibliographic Details
Published in:	Image and vision computing 2024-02, Vol.142, p.104911, Article 104911
Main Authors:	Bigioi, Dan, Basak, Shubhajit, Stypułkowski, Michał, Zieba, Maciej, Jordan, Hugh, McDonnell, Rachel, Corcoran, Peter
Format:	Article
Language:	English
Subjects:	Diffusion models Dubbing Generative AI Talking head generation Video editing
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Taking inspiration from recent developments in visual generative tasks using diffusion models, we propose a method for end-to-end speech-driven video editing using a denoising diffusion model. Given a video of a talking person, and a separate auditory speech recording, the lip and jaw motions are re-synchronised without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model on audio mel spectral features to generate synchronised facial motion. Proof of concept results are demonstrated on both single-speaker and multi-speaker video editing, providing a baseline model on the CREMA-D audiovisual data set. To the best of our knowledge, this is the first work to demonstrate and validate the feasibility of applying end-to-end denoising diffusion models to the task of audio-driven video editing. All code, datasets, and models used as part of this work are made publicly available here: https://danbigioi.github.io/DiffusionVideoEditing/. •Denoising diffusion models for speech driven video editing.•Present a speech-conditioned diffusion model for this task.•We demonstrate promising results on the GRID and CREMA-D datasets.•An unstructured diffusion-based approach can generate high quality image frames without complex loss function.
ISSN:	0262-8856 1872-8138
DOI:	10.1016/j.imavis.2024.104911