Loading…

MTSCANet: Multi temporal resolution temporal semantic context aggregation network

Temporal action localisation is a challenging task, and video context is crucial to localisation actions. Most existing cases that incorporate temporal and semantic contexts into video features suffer from single contextual representation and blurred temporal boundaries. In this study, a multi‐tempo...

Full description

Saved in:
Bibliographic Details
Published in:IET computer vision 2023-04, Vol.17 (3), p.366-378
Main Authors: Zhang, Haiping, Ma, Conghao, Yu, Dongjin, Guan, Liming, Wang, Dongjing, Hu, Zepeng, Liu, Xu
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Temporal action localisation is a challenging task, and video context is crucial to localisation actions. Most existing cases that incorporate temporal and semantic contexts into video features suffer from single contextual representation and blurred temporal boundaries. In this study, a multi‐temporal resolution pyramid structure model is proposed. Firstly, a temporal‐semantic context aggregation module (TSCF) is designed to assign different attention weights to temporal contexts and combine them with multi‐level semantics into video features. Secondly, for the problem of large differences in the time span between different actions in the video, a local‐global attention module is designed to combine local and global temporal dependencies for each temporal point to obtain a more flexible and robust representation of contextual relations. The redundant representation of the convolution kernel is reduced by modifying the convolution and the arithmetic power is redeployed at a microscopic granularity. To verify the effectiveness of the model, extensive experiments on three challenging datasets are performed. On THUMOS14, the best performance is obtained in IoU@0.3–0.6 with an average mAP of 47.02%. On ActivityNet‐1.3, an average mAP of 34.94% was obtained. On HACS, an average mAP of 28.46% was achieved. Using a multi‐temporal resolution pyramid structure model, aggregating temporal and semantic contextual information, balancing local and global information by adding an attention mechanism.
ISSN:1751-9632
1751-9640
DOI:10.1049/cvi2.12163