Loading…

Transformer-Based Self-Supervised Monocular Depth and Visual Odometry

Self-supervised monocular depth and visual odometry (VO) are often cast as coupled tasks. Accurate depth contributes to precise pose estimation and vice versa. Existing architectures typically exploit stacking convolution layers and long short-term memory (LSTM) units to capture long-range dependenc...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE sensors journal 2023-01, Vol.23 (2), p.1436-1446
Main Authors:	Zhao, Hongru, Qiao, Xiuquan, Ma, Yi, Tafazolli, Rahim
Format:	Article
Language:	English
Subjects:	Ablation Cameras Convolution Data augmentation loss Feature extraction Generators Intelligent agents long-range dependencies monocular depth estimation multihead self-attention (MHSA) Pose estimation Similarity Spatial dependencies Transformers Visual odometry visual odometry (VO)
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Self-supervised monocular depth and visual odometry (VO) are often cast as coupled tasks. Accurate depth contributes to precise pose estimation and vice versa. Existing architectures typically exploit stacking convolution layers and long short-term memory (LSTM) units to capture long-range dependencies. However, their intrinsic locality hinders the model from getting the expected performance gain. In this article, we propose a Transformer-based architecture, named Transformer-based self-supervised monocular depth and VO (TSSM-VO), to tackle these problems. It comprises two main components: 1) a depth generator that leverages the powerful capability of multihead self-attention (MHSA) on modeling long-range spatial dependencies and 2) a pose estimator built upon a Transformer to learn long-range temporal correlations of image sequences. Moreover, a new data augmentation loss based on structural similarity (SSIM) is introduced to constrain further the structural similarity between the augmented depth and the augmented predicted depth. Rigorous ablation studies and exhaustive performance comparison on the KITTI and Make3D datasets demonstrate the superiority of TSSM-VO over other self-supervised methods. We expect that TSSM-VO would enhance the ability of intelligent agents to understand the surrounding environments.
ISSN:	1530-437X 1558-1748
DOI:	10.1109/JSEN.2022.3227017