Loading…

Can relearning local representation help small networks for human pose estimation?

Human pose estimation is a special detection task for small object localization. It requires considering not only global structure but local and fine detail due to variable body poses and complex scenes. However, with the sliding window learning mechanism, the convolutional neural network (CNN) can...

Full description

Saved in:

Bibliographic Details
Published in:	Neurocomputing (Amsterdam) 2023-01, Vol.518, p.418-430
Main Authors:	Xu, Dingning, Guo, Lijun, Zhang, Rong, Qian, Jiangbo, Gao, Shangce
Format:	Article
Language:	English
Subjects:	Convolutional neural network Human pose estimation Integrated attention Layer-channel mixed attention Receptive fields
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Human pose estimation is a special detection task for small object localization. It requires considering not only global structure but local and fine detail due to variable body poses and complex scenes. However, with the sliding window learning mechanism, the convolutional neural network (CNN) can only see the spatial information in a specific size of receptive field in a certain layer. As the network deepens and the receptive field becomes larger, the network gradually focuses on the global spatial information and loses the perception of local features. To help the deep convolutional neural network have the ability to relearn local information for structure analysis in deeper layers, we propose a layer-channel mixed attention mechanism named integrated attention that can be flexibly embedded into a CNN. Multiple features from the previous layers are aggregated to build attention with synchronously observing different ranges of spatial structures. Through our integrated attention, the network can observe the interdependence between local structures across different receptive fields and more clues can be learned to enhance the expressive power of the network for feature learning. The results of extensive experiments show that the integrated attention mechanism is beneficial to human pose estimation. In particular, the integrated attention can help small networks achieve more accurate predictions and even outperforms larger ones with less computation and parameters. Compared with other attention and keypoint refinement modules, our improvement effect is more stable and better.
ISSN:	0925-2312 1872-8286
DOI:	10.1016/j.neucom.2022.11.025