Loading…

Continuous frame motion sensitive self-supervised collaborative network for video representation learning

Motion, as a feature of video that changes in temporal sequences, is crucial to visual understanding. The powerful video representation and extraction models are typically able to focus attention on motion features in challenging dynamic environments to complete more complex video understanding task...

Full description

Saved in:

Bibliographic Details
Published in:	Advanced engineering informatics 2023-04, Vol.56, p.101941, Article 101941
Main Authors:	Bi, Shuai, Hu, Zhengping, Zhao, Mengyao, Zhang, Hehao, Di, Jirui, Sun, Zhe
Format:	Article
Language:	English
Subjects:	Action recognition Global–local contrastive learning Pretext task Self-supervised representation learning Video retrieval
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c297t-bda66c28b4149666b048ad50dad18c9b6fe796c0c011185264f5219f892b965a3
cites	cdi_FETCH-LOGICAL-c297t-bda66c28b4149666b048ad50dad18c9b6fe796c0c011185264f5219f892b965a3
container_end_page
container_issue
container_start_page	101941
container_title	Advanced engineering informatics
container_volume	56
creator	Bi, Shuai Hu, Zhengping Zhao, Mengyao Zhang, Hehao Di, Jirui Sun, Zhe
description	Motion, as a feature of video that changes in temporal sequences, is crucial to visual understanding. The powerful video representation and extraction models are typically able to focus attention on motion features in challenging dynamic environments to complete more complex video understanding tasks. However, previous approaches discriminate mainly based on similar features in the spatial or temporal domain, ignoring the interdependence of consecutive video frames. In this paper, we propose the motion sensitive self-supervised collaborative network, a video representation learning framework that exploits a pretext task to assist feature comparison and strengthen the spatiotemporal discrimination power of the model. Specifically, we first propose the motion-aware module, which extracts consecutive motion features from the spatial regions by frame difference. The global–local contrastive module is then introduced, with context and enhanced video snippets being defined as appropriate positive samples for a broader feature similarity comparison. Finally, we introduce the snippet operation prediction module, which further assists contrastive learning to obtain more reliable global semantics by sensing changes in continuous frame features. Experimental results demonstrate that our work can effectively extract robust motion features and achieve competitive performance compared with other state-of-the-art self-supervised methods on downstream action recognition and video retrieval tasks. •Accurate extraction of continuous motion features in complex environments.•Acquire global and local spatio-temporal features via correlation of sequences.•Obtain supervised information for feature discrimination in continuous clips.•Achieve competitive performance with limited data pre-training.
doi_str_mv	10.1016/j.aei.2023.101941
format	article
fullrecord	<record><control><sourceid>elsevier_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1016_j_aei_2023_101941</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S1474034623000691</els_id><sourcerecordid>S1474034623000691</sourcerecordid><originalsourceid>FETCH-LOGICAL-c297t-bda66c28b4149666b048ad50dad18c9b6fe796c0c011185264f5219f892b965a3</originalsourceid><addsrcrecordid>eNp9kM1KAzEURoMoWKsP4C4vMDXJZDITXEnxDwpudB0yyY2kTpOSzIz49qata1f3u1zOx-UgdEvJihIq7rYrDX7FCKsPu-T0DC1o19ZVUzNyXjJveUVqLi7RVc5bUphOtgvk1zGMPkxxytglvQO8i6OPAWcI2Y9-hpIGV-VpD2n2GSw2cRh0H5M-XgOM3zF9YRcTnr2FiBPsExR81MeiAXQKPnxeowunhww3f3OJPp4e39cv1ebt-XX9sKkMk-1Y9VYLYVjXc8qlEKInvNO2IVZb2hnZCwetFIYYQintGia4axiVrpOsl6LR9RLRU69JMecETu2T3-n0oyhRB1dqq4ordXClTq4Kc39ioDw2e0gqGw_BgPUJzKhs9P_Qv-wAdHE</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Continuous frame motion sensitive self-supervised collaborative network for video representation learning</title><source>Elsevier</source><creator>Bi, Shuai ; Hu, Zhengping ; Zhao, Mengyao ; Zhang, Hehao ; Di, Jirui ; Sun, Zhe</creator><creatorcontrib>Bi, Shuai ; Hu, Zhengping ; Zhao, Mengyao ; Zhang, Hehao ; Di, Jirui ; Sun, Zhe</creatorcontrib><description>Motion, as a feature of video that changes in temporal sequences, is crucial to visual understanding. The powerful video representation and extraction models are typically able to focus attention on motion features in challenging dynamic environments to complete more complex video understanding tasks. However, previous approaches discriminate mainly based on similar features in the spatial or temporal domain, ignoring the interdependence of consecutive video frames. In this paper, we propose the motion sensitive self-supervised collaborative network, a video representation learning framework that exploits a pretext task to assist feature comparison and strengthen the spatiotemporal discrimination power of the model. Specifically, we first propose the motion-aware module, which extracts consecutive motion features from the spatial regions by frame difference. The global–local contrastive module is then introduced, with context and enhanced video snippets being defined as appropriate positive samples for a broader feature similarity comparison. Finally, we introduce the snippet operation prediction module, which further assists contrastive learning to obtain more reliable global semantics by sensing changes in continuous frame features. Experimental results demonstrate that our work can effectively extract robust motion features and achieve competitive performance compared with other state-of-the-art self-supervised methods on downstream action recognition and video retrieval tasks. •Accurate extraction of continuous motion features in complex environments.•Acquire global and local spatio-temporal features via correlation of sequences.•Obtain supervised information for feature discrimination in continuous clips.•Achieve competitive performance with limited data pre-training.</description><identifier>ISSN: 1474-0346</identifier><identifier>EISSN: 1873-5320</identifier><identifier>DOI: 10.1016/j.aei.2023.101941</identifier><language>eng</language><publisher>Elsevier Ltd</publisher><subject>Action recognition ; Global–local contrastive learning ; Pretext task ; Self-supervised representation learning ; Video retrieval</subject><ispartof>Advanced engineering informatics, 2023-04, Vol.56, p.101941, Article 101941</ispartof><rights>2023 Elsevier Ltd</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c297t-bda66c28b4149666b048ad50dad18c9b6fe796c0c011185264f5219f892b965a3</citedby><cites>FETCH-LOGICAL-c297t-bda66c28b4149666b048ad50dad18c9b6fe796c0c011185264f5219f892b965a3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,777,781,27905,27906</link.rule.ids></links><search><creatorcontrib>Bi, Shuai</creatorcontrib><creatorcontrib>Hu, Zhengping</creatorcontrib><creatorcontrib>Zhao, Mengyao</creatorcontrib><creatorcontrib>Zhang, Hehao</creatorcontrib><creatorcontrib>Di, Jirui</creatorcontrib><creatorcontrib>Sun, Zhe</creatorcontrib><title>Continuous frame motion sensitive self-supervised collaborative network for video representation learning</title><title>Advanced engineering informatics</title><description>Motion, as a feature of video that changes in temporal sequences, is crucial to visual understanding. The powerful video representation and extraction models are typically able to focus attention on motion features in challenging dynamic environments to complete more complex video understanding tasks. However, previous approaches discriminate mainly based on similar features in the spatial or temporal domain, ignoring the interdependence of consecutive video frames. In this paper, we propose the motion sensitive self-supervised collaborative network, a video representation learning framework that exploits a pretext task to assist feature comparison and strengthen the spatiotemporal discrimination power of the model. Specifically, we first propose the motion-aware module, which extracts consecutive motion features from the spatial regions by frame difference. The global–local contrastive module is then introduced, with context and enhanced video snippets being defined as appropriate positive samples for a broader feature similarity comparison. Finally, we introduce the snippet operation prediction module, which further assists contrastive learning to obtain more reliable global semantics by sensing changes in continuous frame features. Experimental results demonstrate that our work can effectively extract robust motion features and achieve competitive performance compared with other state-of-the-art self-supervised methods on downstream action recognition and video retrieval tasks. •Accurate extraction of continuous motion features in complex environments.•Acquire global and local spatio-temporal features via correlation of sequences.•Obtain supervised information for feature discrimination in continuous clips.•Achieve competitive performance with limited data pre-training.</description><subject>Action recognition</subject><subject>Global–local contrastive learning</subject><subject>Pretext task</subject><subject>Self-supervised representation learning</subject><subject>Video retrieval</subject><issn>1474-0346</issn><issn>1873-5320</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNp9kM1KAzEURoMoWKsP4C4vMDXJZDITXEnxDwpudB0yyY2kTpOSzIz49qata1f3u1zOx-UgdEvJihIq7rYrDX7FCKsPu-T0DC1o19ZVUzNyXjJveUVqLi7RVc5bUphOtgvk1zGMPkxxytglvQO8i6OPAWcI2Y9-hpIGV-VpD2n2GSw2cRh0H5M-XgOM3zF9YRcTnr2FiBPsExR81MeiAXQKPnxeowunhww3f3OJPp4e39cv1ebt-XX9sKkMk-1Y9VYLYVjXc8qlEKInvNO2IVZb2hnZCwetFIYYQintGia4axiVrpOsl6LR9RLRU69JMecETu2T3-n0oyhRB1dqq4ordXClTq4Kc39ioDw2e0gqGw_BgPUJzKhs9P_Qv-wAdHE</recordid><startdate>202304</startdate><enddate>202304</enddate><creator>Bi, Shuai</creator><creator>Hu, Zhengping</creator><creator>Zhao, Mengyao</creator><creator>Zhang, Hehao</creator><creator>Di, Jirui</creator><creator>Sun, Zhe</creator><general>Elsevier Ltd</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>202304</creationdate><title>Continuous frame motion sensitive self-supervised collaborative network for video representation learning</title><author>Bi, Shuai ; Hu, Zhengping ; Zhao, Mengyao ; Zhang, Hehao ; Di, Jirui ; Sun, Zhe</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c297t-bda66c28b4149666b048ad50dad18c9b6fe796c0c011185264f5219f892b965a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Action recognition</topic><topic>Global–local contrastive learning</topic><topic>Pretext task</topic><topic>Self-supervised representation learning</topic><topic>Video retrieval</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Bi, Shuai</creatorcontrib><creatorcontrib>Hu, Zhengping</creatorcontrib><creatorcontrib>Zhao, Mengyao</creatorcontrib><creatorcontrib>Zhang, Hehao</creatorcontrib><creatorcontrib>Di, Jirui</creatorcontrib><creatorcontrib>Sun, Zhe</creatorcontrib><collection>CrossRef</collection><jtitle>Advanced engineering informatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Bi, Shuai</au><au>Hu, Zhengping</au><au>Zhao, Mengyao</au><au>Zhang, Hehao</au><au>Di, Jirui</au><au>Sun, Zhe</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Continuous frame motion sensitive self-supervised collaborative network for video representation learning</atitle><jtitle>Advanced engineering informatics</jtitle><date>2023-04</date><risdate>2023</risdate><volume>56</volume><spage>101941</spage><pages>101941-</pages><artnum>101941</artnum><issn>1474-0346</issn><eissn>1873-5320</eissn><abstract>Motion, as a feature of video that changes in temporal sequences, is crucial to visual understanding. The powerful video representation and extraction models are typically able to focus attention on motion features in challenging dynamic environments to complete more complex video understanding tasks. However, previous approaches discriminate mainly based on similar features in the spatial or temporal domain, ignoring the interdependence of consecutive video frames. In this paper, we propose the motion sensitive self-supervised collaborative network, a video representation learning framework that exploits a pretext task to assist feature comparison and strengthen the spatiotemporal discrimination power of the model. Specifically, we first propose the motion-aware module, which extracts consecutive motion features from the spatial regions by frame difference. The global–local contrastive module is then introduced, with context and enhanced video snippets being defined as appropriate positive samples for a broader feature similarity comparison. Finally, we introduce the snippet operation prediction module, which further assists contrastive learning to obtain more reliable global semantics by sensing changes in continuous frame features. Experimental results demonstrate that our work can effectively extract robust motion features and achieve competitive performance compared with other state-of-the-art self-supervised methods on downstream action recognition and video retrieval tasks. •Accurate extraction of continuous motion features in complex environments.•Acquire global and local spatio-temporal features via correlation of sequences.•Obtain supervised information for feature discrimination in continuous clips.•Achieve competitive performance with limited data pre-training.</abstract><pub>Elsevier Ltd</pub><doi>10.1016/j.aei.2023.101941</doi></addata></record>
fulltext	fulltext
identifier	ISSN: 1474-0346
ispartof	Advanced engineering informatics, 2023-04, Vol.56, p.101941, Article 101941
issn	1474-0346 1873-5320
language	eng
recordid	cdi_crossref_primary_10_1016_j_aei_2023_101941
source	Elsevier
subjects	Action recognition Global–local contrastive learning Pretext task Self-supervised representation learning Video retrieval
title	Continuous frame motion sensitive self-supervised collaborative network for video representation learning
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-21T07%3A23%3A26IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-elsevier_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Continuous%20frame%20motion%20sensitive%20self-supervised%20collaborative%20network%20for%20video%20representation%20learning&rft.jtitle=Advanced%20engineering%20informatics&rft.au=Bi,%20Shuai&rft.date=2023-04&rft.volume=56&rft.spage=101941&rft.pages=101941-&rft.artnum=101941&rft.issn=1474-0346&rft.eissn=1873-5320&rft_id=info:doi/10.1016/j.aei.2023.101941&rft_dat=%3Celsevier_cross%3ES1474034623000691%3C/elsevier_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c297t-bda66c28b4149666b048ad50dad18c9b6fe796c0c011185264f5219f892b965a3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true