Loading…
Talking-head video generation with long short-term contextual semantics
One-shot talking-head video generation involves a face-appearance source image and a series of motions extracted from driving frames to produce a coherent video. Most existing methods merely use the source image to generate videos over long time intervals, which leads to detail loss and distorted im...
Saved in:
Published in: | Applied intelligence (Dordrecht, Netherlands) Netherlands), 2025-01, Vol.55 (2), p.120 |
---|---|
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | One-shot talking-head video generation involves a face-appearance source image and a series of motions extracted from driving frames to produce a coherent video. Most existing methods merely use the source image to generate videos over long time intervals, which leads to detail loss and distorted images due to the semantics mismatch. Short-term semantics extracted from previous generated frames with temporal consistency can complement the mismatches of long-term semantics. In this paper, we propose a talking-head generation method utilizing long short-term contextual semantics. First, the cross-entropy of real frame and generated frame with long short-term Semantics is mathematically modeled. Then, a novel semi-autoregressive GAN is proposed to efficiently avoid semantics mismatch by utilizing complementary long-term and autoregressively extracted short-term semantics. Moreover, a short-term semantics enhancement module is proposed aiming for suppressing the noise in the autoregressive pipeline and reinforcing fusion of the long short-term semantics. Extensive experiments have been performed and the experimental results demonstrate that our method can generate detailed and refined frames and outperforms the other methods, particularly with large motion changes. |
---|---|
ISSN: | 0924-669X 1573-7497 |
DOI: | 10.1007/s10489-024-06010-y |