Loading…

Two-stage 3D object detection guided by position encoding

Voxel-based structures in 3D detection have achieved rapid advancement due to their superior capability for feature extraction. However, the accuracy is usually low because the point cloud is divided into a grid. In order to overcome the above problems and improve detection accuracy, we propose a fl...

Full description

Saved in:

Bibliographic Details
Published in:	Neurocomputing (Amsterdam) 2022-08, Vol.501, p.811-821
Main Authors:	Xu, Wanpeng, Zou, Ling, Fu, Zhipeng, Wu, Lingda, Qi, Yue
Format:	Article
Language:	English
Subjects:	3D detection Position encoding Self-attention Transformer
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Voxel-based structures in 3D detection have achieved rapid advancement due to their superior capability for feature extraction. However, the accuracy is usually low because the point cloud is divided into a grid. In order to overcome the above problems and improve detection accuracy, we propose a flexible two-stage 3D object detection architecture, which adopts two branches to refine generated proposals, aggregating voxel features and raw point features simultaneously. We also design a new gating mechanism to achieve fusion features from different levels. In addition, we propose a novel feature aggregation module to reduce the semantic gap between the features of the two types. First, a transformer based on raw points is employed as an encoder to aggregate the contextual information. Then, the point-based channel-wise self-attention mechanism serves as a decoder to aggregate the global features. Experiment results on the KITTI 3D dataset and Waymo Open datest demonstrate that our approach outperforms the state-of-the-art methods and exhibits excellent scalability.
ISSN:	0925-2312 1872-8286
DOI:	10.1016/j.neucom.2022.06.030