Loading…

Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization

In the field of human action recognition, it is a long-standing challenge to characterize the video-level spatio-temporal features effectively. This is attributable in part to the inability of CNN to model long-range temporal information, especially for actions that consist of multiple staged behavi...

Full description

Saved in:

Bibliographic Details
Published in:	Scientific reports 2024-10, Vol.14 (1), p.26202-17, Article 26202
Main Authors:	Weng, Zhengkui, Li, Xinmin, Xiong, Shoujian
Format:	Article
Language:	English
Subjects:	639/705/117 639/705/258 Benchmarks Datasets Humanities and Social Sciences Methods multidisciplinary Neural networks Optimization Science Science (multidisciplinary) Semantics Temporal variations
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	In the field of human action recognition, it is a long-standing challenge to characterize the video-level spatio-temporal features effectively. This is attributable in part to the inability of CNN to model long-range temporal information, especially for actions that consist of multiple staged behaviors. In this paper, a novel attention-based spatio-temporal VLAD network (AST-VLAD) with self-attention model is developed to aggregate the informative deep features across the video according to the adaptive deep feature selected. Moreover, an overall automatic approach to adaptive video sequences optimization (AVSO) is proposed through shot segmentation and dynamic weighted sampling, the AVSO increase in the proportion of action-related frames and eliminate the redundant intervals. Then, based on the optimized video, a self-attention model is introduced in AST-VLAD to modeling the intrinsic spatio-temporal relationship of deep features instead of solving the frame-level features in an average or max pooling manner. Extensive experiments are conducted on two public benchmarks-HMDB51 and UCF101 for evaluation. As compared to the existing frameworks, results show that the proposed approach performs better or as well in the accuracy of classification on both HMDB51 (73.1% ) and UCF101 (96.0%) datasets.
ISSN:	2045-2322 2045-2322
DOI:	10.1038/s41598-024-75640-6