Loading…

Graph-Based CNNs With Self-Supervised Module for 3D Hand Pose Estimation From Monocular RGB

Hand pose estimation in 3D space from a single RGB image is a highly challenging problem due to self-geometric ambiguities, diverse texture, viewpoints, and self-occlusions. Existing work proves that a network structure with multi-scale resolution subnets, fused in parallel can more effectively show...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on circuits and systems for video technology 2021-04, Vol.31 (4), p.1514-1525
Main Authors:	Guo, Shaoxiang, Rigall, Eric, Qi, Lin, Dong, Xinghui, Li, Haiyan, Dong, Junyu
Format:	Article
Language:	English
Subjects:	Artificial neural networks Benchmarks Cameras Computer vision Convolutional neural networks Datasets Feature extraction Feature maps graph CNNs hand pose estimation Modules Neural networks Pose estimation self-supervision Solid modeling Three-dimensional displays Two dimensional displays
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Hand pose estimation in 3D space from a single RGB image is a highly challenging problem due to self-geometric ambiguities, diverse texture, viewpoints, and self-occlusions. Existing work proves that a network structure with multi-scale resolution subnets, fused in parallel can more effectively shows the spatial accuracy of 2D pose estimation. Nevertheless, the features extracted by traditional convolutional neural networks cannot efficiently express the unique topological structure of hand key points based on discrete and correlated properties. Some applications of hand pose estimation based on traditional convolutional neural networks have demonstrated that the structural similarity between the graph and hand key points can improve the accuracy of the 3D hand pose regression. In this paper, we design and implement an end-to-end network for predicting 3D hand pose from a single RGB image. We first extract multiple feature maps from different resolutions and make parallel feature fusion, and then model a graph-based convolutional neural network module to predict the initial 3D hand key points. Next, we use 2D spatial relationships and 3D geometric knowledge to build a self-supervised module to eliminate domain gaps between 2D and 3D space. Finally, the final 3D hand pose is calculated by averaging the 3D hand poses from the GCN output and the self-supervised module output. We evaluate the proposed method on two challenging benchmark datasets for 3D hand pose estimation. Experimental results show the effectiveness of our proposed method that achieves state-of-the-art performance on the benchmark datasets.
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2020.3004453