Loading…

Bidirectional Error-Aware Fusion Network for Video Inpainting

Existing video inpainting approaches tend to adopt vision transformers with rare customized designs, which poses two limitations. Firstly, the conventional self-attention mechanism treats tokens from invalid and valid regions equally and mingles them, which may incur blurriness. Secondly, these appr...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on circuits and systems for video technology 2024-09, p.1-1
Main Authors:	Hou, Jiacheng, Ji, Zhong, Yang, Jinyu, Zheng, Feng
Format:	Article
Language:	English
Subjects:	Circuits and systems Computer vision conditional video content synthesis Matrix decomposition Optical flow Semantics Three-dimensional displays Transformers Video inpainting vision transformer
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Existing video inpainting approaches tend to adopt vision transformers with rare customized designs, which poses two limitations. Firstly, the conventional self-attention mechanism treats tokens from invalid and valid regions equally and mingles them, which may incur blurriness. Secondly, these approaches merely employ forward frames as references, while ignoring the past inpainted frames, which are also valuable in enhancing temporal consistency and offering more available information. In this paper, we propose a new video inpainting network, called Bidirectional Error-Aware Fusion Network (BEAF-Net). Concretely, on one hand, we propose a tailored Error-Aware Transformer (EAT) that discerns different tokens by assigning dynamic weights to bridle the use of erroneous tokens. Meanwhile, each EAT is equipped with a Spatial Feature Enhancement (SFE) layer to synthesize features with multi-scales. On the other hand, we apply a pair of EATs to utilize forward reference frames and past inpainted frames simultaneously, and a proposed Bidirectional Fusion (BiF) layer is exerted to blend the aggregation results adaptively. By coupling these novel designs, our proposed BEAF-Net completely leverages the location priors, multi-scale perception, and past predictions to produce more faithful and consistent inpainting results. We corroborate our BEAF-Net on two commonly-used video inpainting datasets: DAVIS and Youtube-VOS, where the experimental results demonstrate BEAF-Net compares favorably with state-of-the-art solutions. Video examples can be found at https://github.com/JCATCV/BEAF-Net.
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2024.3454641