Loading…

3D Contextual Transformer & Double Inception Network for Human Action Recognition

The 3D Contextual Transformer & Double Inception Network called CoTDIL-Net is proposed for human action recognition. The spatio-temporal enrichment module based on a 3D Contextual Transformer (CoT3D) is proposed for enhancing the features of adjacent frames. In addition, 3D Inception and 2D Ince...

Full description

Saved in:
Bibliographic Details
Main Authors: Liu, Enqi, Hirota, Kaoru, Liu, Chang, Dai, Yaping
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The 3D Contextual Transformer & Double Inception Network called CoTDIL-Net is proposed for human action recognition. The spatio-temporal enrichment module based on a 3D Contextual Transformer (CoT3D) is proposed for enhancing the features of adjacent frames. In addition, 3D Inception and 2D Inception are combined to form the feature extractor called DIFE for capturing short-term contextual features. Moreover, the LSTM is used to obtain long-term action change features, and a multi-stream input framework is introduced to obtain fuller contextual information. It aims to obtain multi-scale spatio-temporal features compared with single convolution methods, where CoT3D combines contextual action information, the DIFE captures short-term features while LSTM fuses long-term features. The experiments are carried out on a laptop with 32G RAM and a GeForce RTX3070 8G GPU by using the KTH dataset, and the results show a recognition accuracy of 97.2%. The obtained results indicate that the proposed CoTDIL-Net promote the convolutional structure understanding of human actions changes.
ISSN:1948-9447
DOI:10.1109/CCDC58219.2023.10326469