Loading…

Three-stream CNNs for action recognition

•A novel three-stream CNNs architecture for action feature extraction is proposed.•An effective encoding scheme for action representation is presented.•Promising action recognition results on challenging datasets. Existing Convolutional Neural Networks (CNNs) based methods for action recognition are...

Full description

Saved in:
Bibliographic Details
Published in:Pattern recognition letters 2017-06, Vol.92, p.33-40
Main Authors: Wang, Liangliang, Ge, Lianzheng, Li, Ruifeng, Fang, Yajun
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•A novel three-stream CNNs architecture for action feature extraction is proposed.•An effective encoding scheme for action representation is presented.•Promising action recognition results on challenging datasets. Existing Convolutional Neural Networks (CNNs) based methods for action recognition are either spatial or temporally local while actions are 3D signals. In this paper, we propose a global spatial-temporal three-stream CNNs architecture, which is able to be used for action feature extraction. Specifically, the three-stream CNNs comprises of spatial, local temporal and global temporal streams generated respectively from deep learning single frame, optical flow and global accumulated motion features in the form of a new formulation named Motion Stacked Difference Image (MSDI). Moreover, a novel soft Vector of Locally Aggregated Descriptors (soft-VLAD) is developed to further represent the extracted features, combining the advantage of Gaussian Mixture Models (GMMs) and VLAD by encoding data according to their overall probability distribution and the corresponding difference with respect to clustered centers. To deal with the inadequacy of training samples during learning, we introduce a data augmentation scheme which is very efficient due to its origin at cropping across videos. We conduct our experiments on UCF101 and HMDB51 datasets, and the results demonstrate the effectiveness of our approach.
ISSN:0167-8655
1872-7344
DOI:10.1016/j.patrec.2017.04.004