Loading…

Pooling the Convolutional Layers in Deep ConvNets for Video Action Recognition

Deep ConvNets have shown their good performance in image classification tasks. However, there still remains problems in deep video representations for action recognition. On one hand, current video ConvNets are relatively shallow compared with image ConvNets, which limits their capability of capturi...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on circuits and systems for video technology 2018-08, Vol.28 (8), p.1839-1849
Main Authors:	Zhao, Shichao, Liu, Yanbin, Han, Yahong, Hong, Richang, Hu, Qinghua, Tian, Qi
Format:	Article
Language:	English
Subjects:	Coding ConvNets Convolutional codes Encoding Feature extraction feature fusion Histograms Image classification Image recognition Object recognition Optical imaging pooling strategy Representations State of the art Trajectories Trajectory Video data video representations
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Deep ConvNets have shown their good performance in image classification tasks. However, there still remains problems in deep video representations for action recognition. On one hand, current video ConvNets are relatively shallow compared with image ConvNets, which limits their capability of capturing the complex video action information; on the other hand, temporal information of videos is not properly utilized to pool and encode the video sequences. Toward these issues, in this paper we utilize two state-of-the-art ConvNets, i.e., the very deep spatial net (VGGNet [1] ) and the temporal net from Two-Stream ConvNets [2] , for action representation. The convolutional layers and the proposed new layer, called frame-diff layer, are extracted and pooled with two temporal pooling strategies: Trajectory pooling and Line pooling. The pooled local descriptors are then encoded with vector of locally aggregated descriptors (VLAD) [3] to form the video representations. In order to verify the effectiveness of the proposed framework, we conduct experiments on UCF101 and HMDB51 data sets. It achieves accuracy of 92.08% on UCF101, which is the state-of-the-art, and the accuracy of 65.62% on HMDB51, which is comparable to the state-of-the-art. In addition, we propose the new Line pooling strategy, which can speed up the extraction of feature and achieve the comparable performance of the Trajectory pooling.
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2017.2682196