Loading…
Human Action Recognition Methods Based on CNNs for RGB Video Input
Human Action recognition is a complex problem that attracts more and more researchers from the scientific community due to its applicability in domains such as security and behavior analysis. At its core, this problem entails classifying an action into a finite set of classes. Neural network based a...
Saved in:
Main Authors: | , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Human Action recognition is a complex problem that attracts more and more researchers from the scientific community due to its applicability in domains such as security and behavior analysis. At its core, this problem entails classifying an action into a finite set of classes. Neural network based approaches, and especially convolutional neural networks, are a good starting point for solving the problem of human action recognition. Due to their nature, they can recognize spatio-temporal features very well, making them ideal for working with sequences of RGB images. In this paper are proposed three types of convolutional neural network architectures that contribute to solving the problem of Human Action Recognition. The first one is based on 2D kernels, the second one on 3D kernels, and the third one on TCN (Temporal Convolutional Network) units. Each one is presented with its structure, advantages and disadvantages, along with metrics that measure their performance. The one based on 2D convolutions is the fastest, but it also has the lowest performances. The second one is a good middle ground, useful in certain situations which require a fast classifier operating on different action classes. Finally, the one based on TCNs performs close to some of the best existent models. It represents a viable solution to the proposed problem. It can classify many actions, using only RGB images of fairly low resolution, in real time. The three models have been tested on the RGB part of the NTU RGB+D dataset. The 2D convolution-based model obtained an accuracy of 7.43% on the Cross-Subject split and 10.28% on the Cross-View split. The 3D convolution-based model obtained 58.77% on Cross Subject and 56.11% on Cross-View. Finally, the TCN-based model obtained an accuracy of 80.45% on Cross-Subject and an accuracy of 82.57% on Cross-View. |
---|---|
ISSN: | 2379-0482 |
DOI: | 10.1109/CSCS52396.2021.00026 |