Loading…

Video action recognition based on visual rhythm representation

•Development of an efficient representation for action recognition based on spatial and temporal information.•Proposition of a encoding method that samples and reorganizes frame pixels into a compact video description.•Evaluation on challenging public data sets.•Results superior/comparable to the li...

Full description

Saved in:
Bibliographic Details
Published in:Journal of visual communication and image representation 2020-08, Vol.71, p.102771, Article 102771
Main Authors: Moreira, Thierry Pinheiro, Menotti, David, Pedrini, Helio
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•Development of an efficient representation for action recognition based on spatial and temporal information.•Proposition of a encoding method that samples and reorganizes frame pixels into a compact video description.•Evaluation on challenging public data sets.•Results superior/comparable to the literature. Advances in video acquisition and storage technologies have promoted a great demand for automatic recognition of actions. The use of cameras for security and surveillance purposes has applications in several scenarios, such as airports, parks, banks, stations, roads, hospitals, supermarkets, industries, stadiums, schools. An inherent difficulty of the problem is the complexity of the scene under usual recording conditions, which may contain complex background and motion, multiple people on the scene, interactions with other actors or objects, and camera motion. Most recent databases are built primarily with shared recordings on YouTube and with snippets of movies, situations where these obstacles are not restricted. Another difficulty is the impact of the temporal dimension since it expands the size of the data, increasing computational cost and storage space. In this work, we present a methodology of volume description using the Visual Rhythm (VR) representation. This technique reshapes the original volume of the video into an image, where two-dimensional descriptors are computed. We investigated different strategies for constructing the representation by combining configurations in several image domains and traversing directions of the video frames. From this, we propose two feature extraction methods, Naïve Visual Rhythm (Naïve VR) and Visual Rhythm Trajectory Descriptor (VRTD). The first approach is the straightforward application of the technique in the original video volume, forming a holistic descriptor that considers action events as patterns and formats in the visual rhythm image. The second variation focuses on the analysis of small neighborhoods obtained from the process of dense trajectories, which allows the algorithm to capture details unnoticed by the global description. We tested our methods in eight public databases, one of hand gestures (SKIG), two in first person (DogCentric and JPL), and five in third person (Weizmann, KTH, MuHAVi, UCF11 and HMDB51). The results show that the developed techniques are able to extract motion elements along with format and appearance information, achieving competitive accuracy rates compared to state-o
ISSN:1047-3203
1095-9076
DOI:10.1016/j.jvcir.2020.102771