Loading…

Talking-head video generation with long short-term contextual semantics

One-shot talking-head video generation involves a face-appearance source image and a series of motions extracted from driving frames to produce a coherent video. Most existing methods merely use the source image to generate videos over long time intervals, which leads to detail loss and distorted im...

Full description

Saved in:
Bibliographic Details
Published in:Applied intelligence (Dordrecht, Netherlands) Netherlands), 2025-01, Vol.55 (2), p.120
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:One-shot talking-head video generation involves a face-appearance source image and a series of motions extracted from driving frames to produce a coherent video. Most existing methods merely use the source image to generate videos over long time intervals, which leads to detail loss and distorted images due to the semantics mismatch. Short-term semantics extracted from previous generated frames with temporal consistency can complement the mismatches of long-term semantics. In this paper, we propose a talking-head generation method utilizing long short-term contextual semantics. First, the cross-entropy of real frame and generated frame with long short-term Semantics is mathematically modeled. Then, a novel semi-autoregressive GAN is proposed to efficiently avoid semantics mismatch by utilizing complementary long-term and autoregressively extracted short-term semantics. Moreover, a short-term semantics enhancement module is proposed aiming for suppressing the noise in the autoregressive pipeline and reinforcing fusion of the long short-term semantics. Extensive experiments have been performed and the experimental results demonstrate that our method can generate detailed and refined frames and outperforms the other methods, particularly with large motion changes.
ISSN:0924-669X
1573-7497
DOI:10.1007/s10489-024-06010-y