Loading…
3D Contextual Transformer & Double Inception Network for Human Action Recognition
The 3D Contextual Transformer & Double Inception Network called CoTDIL-Net is proposed for human action recognition. The spatio-temporal enrichment module based on a 3D Contextual Transformer (CoT3D) is proposed for enhancing the features of adjacent frames. In addition, 3D Inception and 2D Ince...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The 3D Contextual Transformer & Double Inception Network called CoTDIL-Net is proposed for human action recognition. The spatio-temporal enrichment module based on a 3D Contextual Transformer (CoT3D) is proposed for enhancing the features of adjacent frames. In addition, 3D Inception and 2D Inception are combined to form the feature extractor called DIFE for capturing short-term contextual features. Moreover, the LSTM is used to obtain long-term action change features, and a multi-stream input framework is introduced to obtain fuller contextual information. It aims to obtain multi-scale spatio-temporal features compared with single convolution methods, where CoT3D combines contextual action information, the DIFE captures short-term features while LSTM fuses long-term features. The experiments are carried out on a laptop with 32G RAM and a GeForce RTX3070 8G GPU by using the KTH dataset, and the results show a recognition accuracy of 97.2%. The obtained results indicate that the proposed CoTDIL-Net promote the convolutional structure understanding of human actions changes. |
---|---|
ISSN: | 1948-9447 |
DOI: | 10.1109/CCDC58219.2023.10326469 |