Loading…

Multi-level channel attention excitation network for human action recognition in videos

Channel attention mechanism has continuously attracted strong interests and shown great potential in enhancing the performance of deep CNNs. However, when applied to video-based human action recognition task, most existing methods generally learn channel attention at frame level, which ignores the t...

Full description

Saved in:
Bibliographic Details
Published in:Signal processing. Image communication 2023-05, Vol.114, p.116940, Article 116940
Main Authors: Wu, Hanbo, Ma, Xin, Li, Yibin
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Channel attention mechanism has continuously attracted strong interests and shown great potential in enhancing the performance of deep CNNs. However, when applied to video-based human action recognition task, most existing methods generally learn channel attention at frame level, which ignores the temporal dependencies and may limit the recognition performance. In this paper, we propose a novel multi-level channel attention excitation (MCAE) module to model the temporal-related channel attention at both frame and video levels. Specifically, based on video convolutional feature maps, frame-level channel attention (FCA) is generated by exploring time-channel correlations, and video-level channel attention (VCA) is generated by aggregating global motion variations. MCAE firstly recalibrates video feature responses with frame-wise FCA, and then activates the motion-sensitive channel features with motion-aware VCA. MCAE module learns the channel discriminability from multiple levels and can act as a guidance to facilitate efficient spatiotemporal feature modeling in activated motion-sensitive channels. It can be flexibly embedded into 2D networks with very limited extra computation cost to construct MCAE-Net, which effectively enhances the spatiotemporal representation of 2D models for video action recognition task Extensive experiments on five human action datasets show that our method achieves superior or very competitive performance compared with the state-of-the-arts, which demonstrates the effectiveness of the proposed method for improving the performance of human action recognition. •Learning temporal-related channel attention at both frame and video levels.•Multilevel channel attention activates discriminative action-related channels.•Efficient spatiotemporal feature modeling in motion-salient feature channels.•Spatiotemporal dependency together with channel attention for action recognition.
ISSN:0923-5965
1879-2677
DOI:10.1016/j.image.2023.116940