Loading…

Self-Supervised Monocular Depth Estimation for All-Day Images Based on Dual-Axis Transformer

All-day self-supervised monocular depth estimation has strong practical significance for autonomous systems to continuously perceive the 3D information of the world. However, night-time scenes pose challenges of weak texture and violating the brightness consistency assumption due to low illumination...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on circuits and systems for video technology 2024-10, Vol.34 (10), p.9939-9953
Main Authors: Hou, Shengyu, Fu, Mengyin, Wang, Rongchuan, Yang, Yi, Song, Wenjie
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:All-day self-supervised monocular depth estimation has strong practical significance for autonomous systems to continuously perceive the 3D information of the world. However, night-time scenes pose challenges of weak texture and violating the brightness consistency assumption due to low illumination and varying lighting, respectively, which easily leads to most existing self-supervised models only being able to handle day-time scenes. To address this problem, we propose a self-supervised monocular depth estimation unified framework that can handle all-day scenarios, which has three features: 1) an Illumination Compensation PoseNet (ICP) is designed, which is based on the classic Phong illumination theory and compensates for lighting changes in adjacent frames by estimating per-pixel transformations; 2) a Dual-Axis Transformer (DAT) block is proposed as the backbone network of the depth encoder, which infers the depth of local low-illumination areas through spatial-channel dual-dimensional global context information of night-time images; 3) a cross-layer Adaptive Fusion Module (AFM) is introduced between multiple DAT blocks, which learns attention weights between different layer features and adaptively fuses cross-layer features using the learned weights, enhancing the complementarity of different layer features. This work was evaluated on multiple datasets, including: RobotCar, Waymo and KITTI datasets, achieving state-of-the-art results in both day-time and night-time scenarios.
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2024.3406043