Loading…

Three-stream CNNs for action recognition

•A novel three-stream CNNs architecture for action feature extraction is proposed.•An effective encoding scheme for action representation is presented.•Promising action recognition results on challenging datasets. Existing Convolutional Neural Networks (CNNs) based methods for action recognition are...

Full description

Saved in:

Bibliographic Details
Published in:	Pattern recognition letters 2017-06, Vol.92, p.33-40
Main Authors:	Wang, Liangliang, Ge, Lianzheng, Li, Ruifeng, Fang, Yajun
Format:	Article
Language:	English
Subjects:	Action recognition Artificial intelligence Artificial neural networks Feature extraction Machine learning Motion detectors Moving object recognition Neural networks Optical flow (image analysis) Pattern recognition Recognition Soft Vector of Locally Aggregated Descriptors Spatial distribution Support Vector Machines Three-stream convolutional neural networks
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	•A novel three-stream CNNs architecture for action feature extraction is proposed.•An effective encoding scheme for action representation is presented.•Promising action recognition results on challenging datasets. Existing Convolutional Neural Networks (CNNs) based methods for action recognition are either spatial or temporally local while actions are 3D signals. In this paper, we propose a global spatial-temporal three-stream CNNs architecture, which is able to be used for action feature extraction. Specifically, the three-stream CNNs comprises of spatial, local temporal and global temporal streams generated respectively from deep learning single frame, optical flow and global accumulated motion features in the form of a new formulation named Motion Stacked Difference Image (MSDI). Moreover, a novel soft Vector of Locally Aggregated Descriptors (soft-VLAD) is developed to further represent the extracted features, combining the advantage of Gaussian Mixture Models (GMMs) and VLAD by encoding data according to their overall probability distribution and the corresponding difference with respect to clustered centers. To deal with the inadequacy of training samples during learning, we introduce a data augmentation scheme which is very efficient due to its origin at cropping across videos. We conduct our experiments on UCF101 and HMDB51 datasets, and the results demonstrate the effectiveness of our approach.
ISSN:	0167-8655 1872-7344
DOI:	10.1016/j.patrec.2017.04.004