Loading…

MTSCANet: Multi temporal resolution temporal semantic context aggregation network

Temporal action localisation is a challenging task, and video context is crucial to localisation actions. Most existing cases that incorporate temporal and semantic contexts into video features suffer from single contextual representation and blurred temporal boundaries. In this study, a multi‐tempo...

Full description

Saved in:

Bibliographic Details
Published in:	IET computer vision 2023-04, Vol.17 (3), p.366-378
Main Authors:	Zhang, Haiping, Ma, Conghao, Yu, Dongjin, Guan, Liming, Wang, Dongjing, Hu, Zepeng, Liu, Xu
Format:	Article
Language:	English
Subjects:	Algorithms computer vision Context Convolution convolutional neural nets learning (artificial intelligence) Localization Modules neural net architecture Neural networks Representations Semantics Temporal resolution
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c4423-dbff8b28797ac45f65984077cfaceb2ab6b8228499433f5901d5c613f9e038e83
cites	cdi_FETCH-LOGICAL-c4423-dbff8b28797ac45f65984077cfaceb2ab6b8228499433f5901d5c613f9e038e83
container_end_page	378
container_issue	3
container_start_page	366
container_title	IET computer vision
container_volume	17
creator	Zhang, Haiping Ma, Conghao Yu, Dongjin Guan, Liming Wang, Dongjing Hu, Zepeng Liu, Xu
description	Temporal action localisation is a challenging task, and video context is crucial to localisation actions. Most existing cases that incorporate temporal and semantic contexts into video features suffer from single contextual representation and blurred temporal boundaries. In this study, a multi‐temporal resolution pyramid structure model is proposed. Firstly, a temporal‐semantic context aggregation module (TSCF) is designed to assign different attention weights to temporal contexts and combine them with multi‐level semantics into video features. Secondly, for the problem of large differences in the time span between different actions in the video, a local‐global attention module is designed to combine local and global temporal dependencies for each temporal point to obtain a more flexible and robust representation of contextual relations. The redundant representation of the convolution kernel is reduced by modifying the convolution and the arithmetic power is redeployed at a microscopic granularity. To verify the effectiveness of the model, extensive experiments on three challenging datasets are performed. On THUMOS14, the best performance is obtained in IoU@0.3–0.6 with an average mAP of 47.02%. On ActivityNet‐1.3, an average mAP of 34.94% was obtained. On HACS, an average mAP of 28.46% was achieved. Using a multi‐temporal resolution pyramid structure model, aggregating temporal and semantic contextual information, balancing local and global information by adding an attention mechanism.
doi_str_mv	10.1049/cvi2.12163
format	article
fullrecord	<record><control><sourceid>proquest_doaj_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_93b5a6eb10ac436bb20302eb8e426603</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><doaj_id>oai_doaj_org_article_93b5a6eb10ac436bb20302eb8e426603</doaj_id><sourcerecordid>3092291916</sourcerecordid><originalsourceid>FETCH-LOGICAL-c4423-dbff8b28797ac45f65984077cfaceb2ab6b8228499433f5901d5c613f9e038e83</originalsourceid><addsrcrecordid>eNp9kEtPwzAQhC0EEqVw4RdE4oaU4lecmFsV8ajUghCFq2W7myoljYvjUvrvSRtUbpx2tfp2ZjQIXRI8IJjLG_tV0gGhRLAj1CNpQmIpOD4-7IyeorOmWWCcCCl5D71Mpq_58AnCbTRZV6GMAixXzusq8tC4ah1KV__dGljqOpQ2sq4O8B0iPZ97mOs9VUPYOP9xjk4KXTVw8Tv76O3-bpo_xuPnh1E-HMeWc8rimSmKzNAslam2PClEIjOO09QW2oKh2giTUZrxNiVjRSIxmSVWEFZIwCyDjPXRqNOdOb1QK18utd8qp0u1Pzg_V9q3WStQkplECzAEt1ZMGEMxwxRMBpwKgVmrddVprbz7XEMT1MKtfd3GVwxLSiWRbaV9dN1R1rum8VAcXAlWu_rVrn61r7-FSQdvygq2_5Aqfx_R7ucHW_KGlg</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3092291916</pqid></control><display><type>article</type><title>MTSCANet: Multi temporal resolution temporal semantic context aggregation network</title><source>IET Digital Library</source><source>Wiley_OA刊</source><creator>Zhang, Haiping ; Ma, Conghao ; Yu, Dongjin ; Guan, Liming ; Wang, Dongjing ; Hu, Zepeng ; Liu, Xu</creator><creatorcontrib>Zhang, Haiping ; Ma, Conghao ; Yu, Dongjin ; Guan, Liming ; Wang, Dongjing ; Hu, Zepeng ; Liu, Xu</creatorcontrib><description>Temporal action localisation is a challenging task, and video context is crucial to localisation actions. Most existing cases that incorporate temporal and semantic contexts into video features suffer from single contextual representation and blurred temporal boundaries. In this study, a multi‐temporal resolution pyramid structure model is proposed. Firstly, a temporal‐semantic context aggregation module (TSCF) is designed to assign different attention weights to temporal contexts and combine them with multi‐level semantics into video features. Secondly, for the problem of large differences in the time span between different actions in the video, a local‐global attention module is designed to combine local and global temporal dependencies for each temporal point to obtain a more flexible and robust representation of contextual relations. The redundant representation of the convolution kernel is reduced by modifying the convolution and the arithmetic power is redeployed at a microscopic granularity. To verify the effectiveness of the model, extensive experiments on three challenging datasets are performed. On THUMOS14, the best performance is obtained in IoU@0.3–0.6 with an average mAP of 47.02%. On ActivityNet‐1.3, an average mAP of 34.94% was obtained. On HACS, an average mAP of 28.46% was achieved. Using a multi‐temporal resolution pyramid structure model, aggregating temporal and semantic contextual information, balancing local and global information by adding an attention mechanism.</description><identifier>ISSN: 1751-9632</identifier><identifier>EISSN: 1751-9640</identifier><identifier>DOI: 10.1049/cvi2.12163</identifier><language>eng</language><publisher>Stevenage: John Wiley & Sons, Inc</publisher><subject>Algorithms ; computer vision ; Context ; Convolution ; convolutional neural nets ; learning (artificial intelligence) ; Localization ; Modules ; neural net architecture ; Neural networks ; Representations ; Semantics ; Temporal resolution</subject><ispartof>IET computer vision, 2023-04, Vol.17 (3), p.366-378</ispartof><rights>2022 The Authors. published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology.</rights><rights>2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c4423-dbff8b28797ac45f65984077cfaceb2ab6b8228499433f5901d5c613f9e038e83</citedby><cites>FETCH-LOGICAL-c4423-dbff8b28797ac45f65984077cfaceb2ab6b8228499433f5901d5c613f9e038e83</cites><orcidid>0000-0002-1617-2574</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://onlinelibrary.wiley.com/doi/pdf/10.1049%2Fcvi2.12163$$EPDF$$P50$$Gwiley$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://onlinelibrary.wiley.com/doi/full/10.1049%2Fcvi2.12163$$EHTML$$P50$$Gwiley$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,11562,27924,27925,46052,46476</link.rule.ids></links><search><creatorcontrib>Zhang, Haiping</creatorcontrib><creatorcontrib>Ma, Conghao</creatorcontrib><creatorcontrib>Yu, Dongjin</creatorcontrib><creatorcontrib>Guan, Liming</creatorcontrib><creatorcontrib>Wang, Dongjing</creatorcontrib><creatorcontrib>Hu, Zepeng</creatorcontrib><creatorcontrib>Liu, Xu</creatorcontrib><title>MTSCANet: Multi temporal resolution temporal semantic context aggregation network</title><title>IET computer vision</title><description>Temporal action localisation is a challenging task, and video context is crucial to localisation actions. Most existing cases that incorporate temporal and semantic contexts into video features suffer from single contextual representation and blurred temporal boundaries. In this study, a multi‐temporal resolution pyramid structure model is proposed. Firstly, a temporal‐semantic context aggregation module (TSCF) is designed to assign different attention weights to temporal contexts and combine them with multi‐level semantics into video features. Secondly, for the problem of large differences in the time span between different actions in the video, a local‐global attention module is designed to combine local and global temporal dependencies for each temporal point to obtain a more flexible and robust representation of contextual relations. The redundant representation of the convolution kernel is reduced by modifying the convolution and the arithmetic power is redeployed at a microscopic granularity. To verify the effectiveness of the model, extensive experiments on three challenging datasets are performed. On THUMOS14, the best performance is obtained in IoU@0.3–0.6 with an average mAP of 47.02%. On ActivityNet‐1.3, an average mAP of 34.94% was obtained. On HACS, an average mAP of 28.46% was achieved. Using a multi‐temporal resolution pyramid structure model, aggregating temporal and semantic contextual information, balancing local and global information by adding an attention mechanism.</description><subject>Algorithms</subject><subject>computer vision</subject><subject>Context</subject><subject>Convolution</subject><subject>convolutional neural nets</subject><subject>learning (artificial intelligence)</subject><subject>Localization</subject><subject>Modules</subject><subject>neural net architecture</subject><subject>Neural networks</subject><subject>Representations</subject><subject>Semantics</subject><subject>Temporal resolution</subject><issn>1751-9632</issn><issn>1751-9640</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>24P</sourceid><sourceid>DOA</sourceid><recordid>eNp9kEtPwzAQhC0EEqVw4RdE4oaU4lecmFsV8ajUghCFq2W7myoljYvjUvrvSRtUbpx2tfp2ZjQIXRI8IJjLG_tV0gGhRLAj1CNpQmIpOD4-7IyeorOmWWCcCCl5D71Mpq_58AnCbTRZV6GMAixXzusq8tC4ah1KV__dGljqOpQ2sq4O8B0iPZ97mOs9VUPYOP9xjk4KXTVw8Tv76O3-bpo_xuPnh1E-HMeWc8rimSmKzNAslam2PClEIjOO09QW2oKh2giTUZrxNiVjRSIxmSVWEFZIwCyDjPXRqNOdOb1QK18utd8qp0u1Pzg_V9q3WStQkplECzAEt1ZMGEMxwxRMBpwKgVmrddVprbz7XEMT1MKtfd3GVwxLSiWRbaV9dN1R1rum8VAcXAlWu_rVrn61r7-FSQdvygq2_5Aqfx_R7ucHW_KGlg</recordid><startdate>202304</startdate><enddate>202304</enddate><creator>Zhang, Haiping</creator><creator>Ma, Conghao</creator><creator>Yu, Dongjin</creator><creator>Guan, Liming</creator><creator>Wang, Dongjing</creator><creator>Hu, Zepeng</creator><creator>Liu, Xu</creator><general>John Wiley & Sons, Inc</general><general>Wiley</general><scope>24P</scope><scope>WIN</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7XB</scope><scope>8AL</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>L6V</scope><scope>M0N</scope><scope>M7S</scope><scope>P5Z</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>Q9U</scope><scope>S0W</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-1617-2574</orcidid></search><sort><creationdate>202304</creationdate><title>MTSCANet: Multi temporal resolution temporal semantic context aggregation network</title><author>Zhang, Haiping ; Ma, Conghao ; Yu, Dongjin ; Guan, Liming ; Wang, Dongjing ; Hu, Zepeng ; Liu, Xu</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c4423-dbff8b28797ac45f65984077cfaceb2ab6b8228499433f5901d5c613f9e038e83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Algorithms</topic><topic>computer vision</topic><topic>Context</topic><topic>Convolution</topic><topic>convolutional neural nets</topic><topic>learning (artificial intelligence)</topic><topic>Localization</topic><topic>Modules</topic><topic>neural net architecture</topic><topic>Neural networks</topic><topic>Representations</topic><topic>Semantics</topic><topic>Temporal resolution</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Haiping</creatorcontrib><creatorcontrib>Ma, Conghao</creatorcontrib><creatorcontrib>Yu, Dongjin</creatorcontrib><creatorcontrib>Guan, Liming</creatorcontrib><creatorcontrib>Wang, Dongjing</creatorcontrib><creatorcontrib>Hu, Zepeng</creatorcontrib><creatorcontrib>Liu, Xu</creatorcontrib><collection>Wiley_OA刊</collection><collection>Wiley Online Library Journals</collection><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Engineering Collection</collection><collection>Computing Database</collection><collection>Engineering Database</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>ProQuest Central Basic</collection><collection>DELNET Engineering & Technology Collection</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IET computer vision</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhang, Haiping</au><au>Ma, Conghao</au><au>Yu, Dongjin</au><au>Guan, Liming</au><au>Wang, Dongjing</au><au>Hu, Zepeng</au><au>Liu, Xu</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MTSCANet: Multi temporal resolution temporal semantic context aggregation network</atitle><jtitle>IET computer vision</jtitle><date>2023-04</date><risdate>2023</risdate><volume>17</volume><issue>3</issue><spage>366</spage><epage>378</epage><pages>366-378</pages><issn>1751-9632</issn><eissn>1751-9640</eissn><abstract>Temporal action localisation is a challenging task, and video context is crucial to localisation actions. Most existing cases that incorporate temporal and semantic contexts into video features suffer from single contextual representation and blurred temporal boundaries. In this study, a multi‐temporal resolution pyramid structure model is proposed. Firstly, a temporal‐semantic context aggregation module (TSCF) is designed to assign different attention weights to temporal contexts and combine them with multi‐level semantics into video features. Secondly, for the problem of large differences in the time span between different actions in the video, a local‐global attention module is designed to combine local and global temporal dependencies for each temporal point to obtain a more flexible and robust representation of contextual relations. The redundant representation of the convolution kernel is reduced by modifying the convolution and the arithmetic power is redeployed at a microscopic granularity. To verify the effectiveness of the model, extensive experiments on three challenging datasets are performed. On THUMOS14, the best performance is obtained in IoU@0.3–0.6 with an average mAP of 47.02%. On ActivityNet‐1.3, an average mAP of 34.94% was obtained. On HACS, an average mAP of 28.46% was achieved. Using a multi‐temporal resolution pyramid structure model, aggregating temporal and semantic contextual information, balancing local and global information by adding an attention mechanism.</abstract><cop>Stevenage</cop><pub>John Wiley & Sons, Inc</pub><doi>10.1049/cvi2.12163</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0002-1617-2574</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1751-9632
ispartof	IET computer vision, 2023-04, Vol.17 (3), p.366-378
issn	1751-9632 1751-9640
language	eng
recordid	cdi_doaj_primary_oai_doaj_org_article_93b5a6eb10ac436bb20302eb8e426603
source	IET Digital Library; Wiley_OA刊
subjects	Algorithms computer vision Context Convolution convolutional neural nets learning (artificial intelligence) Localization Modules neural net architecture Neural networks Representations Semantics Temporal resolution
title	MTSCANet: Multi temporal resolution temporal semantic context aggregation network
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T02%3A23%3A49IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MTSCANet:%20Multi%20temporal%20resolution%20temporal%20semantic%20context%20aggregation%20network&rft.jtitle=IET%20computer%20vision&rft.au=Zhang,%20Haiping&rft.date=2023-04&rft.volume=17&rft.issue=3&rft.spage=366&rft.epage=378&rft.pages=366-378&rft.issn=1751-9632&rft.eissn=1751-9640&rft_id=info:doi/10.1049/cvi2.12163&rft_dat=%3Cproquest_doaj_%3E3092291916%3C/proquest_doaj_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c4423-dbff8b28797ac45f65984077cfaceb2ab6b8228499433f5901d5c613f9e038e83%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3092291916&rft_id=info:pmid/&rfr_iscdi=true