Loading…

Spatio-Temporal Transformer for Online Video Understanding

Leading methods in the field of online video understanding try to extract useful information from the spatial and temporal dimensions of an input video. But they are suffering from two problems: (1) These methods can only extract local video information, and cannot relate to the important features o...

Full description

Saved in:

Bibliographic Details
Published in:	Journal of physics. Conference series 2022-01, Vol.2171 (1), p.12020
Main Authors:	Du, Zexu, Zhang, Guoliang, Lu, Weijiang, Zhao, Ting, Wu, Peng
Format:	Article
Language:	English
Subjects:	Classification Feature extraction Frames per second Physics Transformers
Citations:	Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Leading methods in the field of online video understanding try to extract useful information from the spatial and temporal dimensions of an input video. But they are suffering from two problems: (1) These methods can only extract local video information, and cannot relate to the important features of the temporal context in the video. (2) Although some methods can quickly process the information of each frame in the video, the processing efficiency of the whole video is not good, so this type of method cannot be applied to online video understanding tasks. This article introduces a Transformer-based network, which considers spatial and temporal content, and can quickly process each video at the same time. Our approach can efficiently handle up to 170 videos with hundreds of frames per second for action classification. Our method achieve 10 to 90 times faster than existing methods on the action classification datasets.
ISSN:	1742-6588 1742-6596
DOI:	10.1088/1742-6596/2171/1/012020