Loading…

F2D-SIFPNet: a frequency 2D Slow-I-Fast-P network for faster compressed video action recognition

Recent video action recognition methods directly use RGB pixels in the compressed domain. The cumbersome decoding process of traditional methods is avoided, enabling efficient recognition. However, these methods require converting the discrete cosine transform (DCT) frequency to an extended RGB pixe...

Full description

Saved in:
Bibliographic Details
Published in:Applied intelligence (Dordrecht, Netherlands) Netherlands), 2024-04, Vol.54 (7), p.5197-5215
Main Authors: Ming, Yue, Zhou, Jiangwan, Jia, Xia, Zheng, Qingfang, Xiong, Lu, Feng, Fan, Hu, Nannan
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Recent video action recognition methods directly use RGB pixels in the compressed domain. The cumbersome decoding process of traditional methods is avoided, enabling efficient recognition. However, these methods require converting the discrete cosine transform (DCT) frequency to an extended RGB pixel representation with heavy time consuming. To alleviate this drawback, a novel frequency 2D Slow-I-Fast-P network (F2D-SIFPNet) is proposed that significantly enhances the speed of action recognition. Initially, a new Frequency-Domain Partial Decompression (FPDec) method was designed for extracting the frequency domain DCT coefficients directly from the compressed video, eliminating the last time-consuming decoding process in FFmpeg. Subsequently, the Frequency-Domain Channel Selection (FCS) strategy was introduced for down-sampling the frequency-domain data, thereby augmenting the saliency of the input. Additionally, the Frequency Slow-I-Fast-P path (FSIFP) and the Adaptive Motion Excitation (AME) module were presented to emphasize the significant frequency components. FSIFP efficiently models slow spatial features and fast temporal changes simultaneously, while the AME generates an adaptive convolution kernel that captures both long-term and short-term motion cues. Extensive experiments were conducted on four public datasets: Kinetics-700, Kinetics-400, UCF-101, and HMDB-51. The results showed superior accuracies of 55.6 % , 74.0 % , 96.3 % and 74.6 % respectively, with preprocessing times being 6.31 times faster.
ISSN:0924-669X
1573-7497
DOI:10.1007/s10489-024-05408-y