Loading…

Continuous frame motion sensitive self-supervised collaborative network for video representation learning

Motion, as a feature of video that changes in temporal sequences, is crucial to visual understanding. The powerful video representation and extraction models are typically able to focus attention on motion features in challenging dynamic environments to complete more complex video understanding task...

Full description

Saved in:
Bibliographic Details
Published in:Advanced engineering informatics 2023-04, Vol.56, p.101941, Article 101941
Main Authors: Bi, Shuai, Hu, Zhengping, Zhao, Mengyao, Zhang, Hehao, Di, Jirui, Sun, Zhe
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c297t-bda66c28b4149666b048ad50dad18c9b6fe796c0c011185264f5219f892b965a3
cites cdi_FETCH-LOGICAL-c297t-bda66c28b4149666b048ad50dad18c9b6fe796c0c011185264f5219f892b965a3
container_end_page
container_issue
container_start_page 101941
container_title Advanced engineering informatics
container_volume 56
creator Bi, Shuai
Hu, Zhengping
Zhao, Mengyao
Zhang, Hehao
Di, Jirui
Sun, Zhe
description Motion, as a feature of video that changes in temporal sequences, is crucial to visual understanding. The powerful video representation and extraction models are typically able to focus attention on motion features in challenging dynamic environments to complete more complex video understanding tasks. However, previous approaches discriminate mainly based on similar features in the spatial or temporal domain, ignoring the interdependence of consecutive video frames. In this paper, we propose the motion sensitive self-supervised collaborative network, a video representation learning framework that exploits a pretext task to assist feature comparison and strengthen the spatiotemporal discrimination power of the model. Specifically, we first propose the motion-aware module, which extracts consecutive motion features from the spatial regions by frame difference. The global–local contrastive module is then introduced, with context and enhanced video snippets being defined as appropriate positive samples for a broader feature similarity comparison. Finally, we introduce the snippet operation prediction module, which further assists contrastive learning to obtain more reliable global semantics by sensing changes in continuous frame features. Experimental results demonstrate that our work can effectively extract robust motion features and achieve competitive performance compared with other state-of-the-art self-supervised methods on downstream action recognition and video retrieval tasks. •Accurate extraction of continuous motion features in complex environments.•Acquire global and local spatio-temporal features via correlation of sequences.•Obtain supervised information for feature discrimination in continuous clips.•Achieve competitive performance with limited data pre-training.
doi_str_mv 10.1016/j.aei.2023.101941
format article
fullrecord <record><control><sourceid>elsevier_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1016_j_aei_2023_101941</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S1474034623000691</els_id><sourcerecordid>S1474034623000691</sourcerecordid><originalsourceid>FETCH-LOGICAL-c297t-bda66c28b4149666b048ad50dad18c9b6fe796c0c011185264f5219f892b965a3</originalsourceid><addsrcrecordid>eNp9kM1KAzEURoMoWKsP4C4vMDXJZDITXEnxDwpudB0yyY2kTpOSzIz49qata1f3u1zOx-UgdEvJihIq7rYrDX7FCKsPu-T0DC1o19ZVUzNyXjJveUVqLi7RVc5bUphOtgvk1zGMPkxxytglvQO8i6OPAWcI2Y9-hpIGV-VpD2n2GSw2cRh0H5M-XgOM3zF9YRcTnr2FiBPsExR81MeiAXQKPnxeowunhww3f3OJPp4e39cv1ebt-XX9sKkMk-1Y9VYLYVjXc8qlEKInvNO2IVZb2hnZCwetFIYYQintGia4axiVrpOsl6LR9RLRU69JMecETu2T3-n0oyhRB1dqq4ordXClTq4Kc39ioDw2e0gqGw_BgPUJzKhs9P_Qv-wAdHE</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Continuous frame motion sensitive self-supervised collaborative network for video representation learning</title><source>Elsevier</source><creator>Bi, Shuai ; Hu, Zhengping ; Zhao, Mengyao ; Zhang, Hehao ; Di, Jirui ; Sun, Zhe</creator><creatorcontrib>Bi, Shuai ; Hu, Zhengping ; Zhao, Mengyao ; Zhang, Hehao ; Di, Jirui ; Sun, Zhe</creatorcontrib><description>Motion, as a feature of video that changes in temporal sequences, is crucial to visual understanding. The powerful video representation and extraction models are typically able to focus attention on motion features in challenging dynamic environments to complete more complex video understanding tasks. However, previous approaches discriminate mainly based on similar features in the spatial or temporal domain, ignoring the interdependence of consecutive video frames. In this paper, we propose the motion sensitive self-supervised collaborative network, a video representation learning framework that exploits a pretext task to assist feature comparison and strengthen the spatiotemporal discrimination power of the model. Specifically, we first propose the motion-aware module, which extracts consecutive motion features from the spatial regions by frame difference. The global–local contrastive module is then introduced, with context and enhanced video snippets being defined as appropriate positive samples for a broader feature similarity comparison. Finally, we introduce the snippet operation prediction module, which further assists contrastive learning to obtain more reliable global semantics by sensing changes in continuous frame features. Experimental results demonstrate that our work can effectively extract robust motion features and achieve competitive performance compared with other state-of-the-art self-supervised methods on downstream action recognition and video retrieval tasks. •Accurate extraction of continuous motion features in complex environments.•Acquire global and local spatio-temporal features via correlation of sequences.•Obtain supervised information for feature discrimination in continuous clips.•Achieve competitive performance with limited data pre-training.</description><identifier>ISSN: 1474-0346</identifier><identifier>EISSN: 1873-5320</identifier><identifier>DOI: 10.1016/j.aei.2023.101941</identifier><language>eng</language><publisher>Elsevier Ltd</publisher><subject>Action recognition ; Global–local contrastive learning ; Pretext task ; Self-supervised representation learning ; Video retrieval</subject><ispartof>Advanced engineering informatics, 2023-04, Vol.56, p.101941, Article 101941</ispartof><rights>2023 Elsevier Ltd</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c297t-bda66c28b4149666b048ad50dad18c9b6fe796c0c011185264f5219f892b965a3</citedby><cites>FETCH-LOGICAL-c297t-bda66c28b4149666b048ad50dad18c9b6fe796c0c011185264f5219f892b965a3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,777,781,27905,27906</link.rule.ids></links><search><creatorcontrib>Bi, Shuai</creatorcontrib><creatorcontrib>Hu, Zhengping</creatorcontrib><creatorcontrib>Zhao, Mengyao</creatorcontrib><creatorcontrib>Zhang, Hehao</creatorcontrib><creatorcontrib>Di, Jirui</creatorcontrib><creatorcontrib>Sun, Zhe</creatorcontrib><title>Continuous frame motion sensitive self-supervised collaborative network for video representation learning</title><title>Advanced engineering informatics</title><description>Motion, as a feature of video that changes in temporal sequences, is crucial to visual understanding. The powerful video representation and extraction models are typically able to focus attention on motion features in challenging dynamic environments to complete more complex video understanding tasks. However, previous approaches discriminate mainly based on similar features in the spatial or temporal domain, ignoring the interdependence of consecutive video frames. In this paper, we propose the motion sensitive self-supervised collaborative network, a video representation learning framework that exploits a pretext task to assist feature comparison and strengthen the spatiotemporal discrimination power of the model. Specifically, we first propose the motion-aware module, which extracts consecutive motion features from the spatial regions by frame difference. The global–local contrastive module is then introduced, with context and enhanced video snippets being defined as appropriate positive samples for a broader feature similarity comparison. Finally, we introduce the snippet operation prediction module, which further assists contrastive learning to obtain more reliable global semantics by sensing changes in continuous frame features. Experimental results demonstrate that our work can effectively extract robust motion features and achieve competitive performance compared with other state-of-the-art self-supervised methods on downstream action recognition and video retrieval tasks. •Accurate extraction of continuous motion features in complex environments.•Acquire global and local spatio-temporal features via correlation of sequences.•Obtain supervised information for feature discrimination in continuous clips.•Achieve competitive performance with limited data pre-training.</description><subject>Action recognition</subject><subject>Global–local contrastive learning</subject><subject>Pretext task</subject><subject>Self-supervised representation learning</subject><subject>Video retrieval</subject><issn>1474-0346</issn><issn>1873-5320</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNp9kM1KAzEURoMoWKsP4C4vMDXJZDITXEnxDwpudB0yyY2kTpOSzIz49qata1f3u1zOx-UgdEvJihIq7rYrDX7FCKsPu-T0DC1o19ZVUzNyXjJveUVqLi7RVc5bUphOtgvk1zGMPkxxytglvQO8i6OPAWcI2Y9-hpIGV-VpD2n2GSw2cRh0H5M-XgOM3zF9YRcTnr2FiBPsExR81MeiAXQKPnxeowunhww3f3OJPp4e39cv1ebt-XX9sKkMk-1Y9VYLYVjXc8qlEKInvNO2IVZb2hnZCwetFIYYQintGia4axiVrpOsl6LR9RLRU69JMecETu2T3-n0oyhRB1dqq4ordXClTq4Kc39ioDw2e0gqGw_BgPUJzKhs9P_Qv-wAdHE</recordid><startdate>202304</startdate><enddate>202304</enddate><creator>Bi, Shuai</creator><creator>Hu, Zhengping</creator><creator>Zhao, Mengyao</creator><creator>Zhang, Hehao</creator><creator>Di, Jirui</creator><creator>Sun, Zhe</creator><general>Elsevier Ltd</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>202304</creationdate><title>Continuous frame motion sensitive self-supervised collaborative network for video representation learning</title><author>Bi, Shuai ; Hu, Zhengping ; Zhao, Mengyao ; Zhang, Hehao ; Di, Jirui ; Sun, Zhe</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c297t-bda66c28b4149666b048ad50dad18c9b6fe796c0c011185264f5219f892b965a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Action recognition</topic><topic>Global–local contrastive learning</topic><topic>Pretext task</topic><topic>Self-supervised representation learning</topic><topic>Video retrieval</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Bi, Shuai</creatorcontrib><creatorcontrib>Hu, Zhengping</creatorcontrib><creatorcontrib>Zhao, Mengyao</creatorcontrib><creatorcontrib>Zhang, Hehao</creatorcontrib><creatorcontrib>Di, Jirui</creatorcontrib><creatorcontrib>Sun, Zhe</creatorcontrib><collection>CrossRef</collection><jtitle>Advanced engineering informatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Bi, Shuai</au><au>Hu, Zhengping</au><au>Zhao, Mengyao</au><au>Zhang, Hehao</au><au>Di, Jirui</au><au>Sun, Zhe</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Continuous frame motion sensitive self-supervised collaborative network for video representation learning</atitle><jtitle>Advanced engineering informatics</jtitle><date>2023-04</date><risdate>2023</risdate><volume>56</volume><spage>101941</spage><pages>101941-</pages><artnum>101941</artnum><issn>1474-0346</issn><eissn>1873-5320</eissn><abstract>Motion, as a feature of video that changes in temporal sequences, is crucial to visual understanding. The powerful video representation and extraction models are typically able to focus attention on motion features in challenging dynamic environments to complete more complex video understanding tasks. However, previous approaches discriminate mainly based on similar features in the spatial or temporal domain, ignoring the interdependence of consecutive video frames. In this paper, we propose the motion sensitive self-supervised collaborative network, a video representation learning framework that exploits a pretext task to assist feature comparison and strengthen the spatiotemporal discrimination power of the model. Specifically, we first propose the motion-aware module, which extracts consecutive motion features from the spatial regions by frame difference. The global–local contrastive module is then introduced, with context and enhanced video snippets being defined as appropriate positive samples for a broader feature similarity comparison. Finally, we introduce the snippet operation prediction module, which further assists contrastive learning to obtain more reliable global semantics by sensing changes in continuous frame features. Experimental results demonstrate that our work can effectively extract robust motion features and achieve competitive performance compared with other state-of-the-art self-supervised methods on downstream action recognition and video retrieval tasks. •Accurate extraction of continuous motion features in complex environments.•Acquire global and local spatio-temporal features via correlation of sequences.•Obtain supervised information for feature discrimination in continuous clips.•Achieve competitive performance with limited data pre-training.</abstract><pub>Elsevier Ltd</pub><doi>10.1016/j.aei.2023.101941</doi></addata></record>
fulltext fulltext
identifier ISSN: 1474-0346
ispartof Advanced engineering informatics, 2023-04, Vol.56, p.101941, Article 101941
issn 1474-0346
1873-5320
language eng
recordid cdi_crossref_primary_10_1016_j_aei_2023_101941
source Elsevier
subjects Action recognition
Global–local contrastive learning
Pretext task
Self-supervised representation learning
Video retrieval
title Continuous frame motion sensitive self-supervised collaborative network for video representation learning
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-21T07%3A23%3A26IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-elsevier_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Continuous%20frame%20motion%20sensitive%20self-supervised%20collaborative%20network%20for%20video%20representation%20learning&rft.jtitle=Advanced%20engineering%20informatics&rft.au=Bi,%20Shuai&rft.date=2023-04&rft.volume=56&rft.spage=101941&rft.pages=101941-&rft.artnum=101941&rft.issn=1474-0346&rft.eissn=1873-5320&rft_id=info:doi/10.1016/j.aei.2023.101941&rft_dat=%3Celsevier_cross%3ES1474034623000691%3C/elsevier_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c297t-bda66c28b4149666b048ad50dad18c9b6fe796c0c011185264f5219f892b965a3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true