Loading…
Continuous frame motion sensitive self-supervised collaborative network for video representation learning
Motion, as a feature of video that changes in temporal sequences, is crucial to visual understanding. The powerful video representation and extraction models are typically able to focus attention on motion features in challenging dynamic environments to complete more complex video understanding task...
Saved in:
Published in: | Advanced engineering informatics 2023-04, Vol.56, p.101941, Article 101941 |
---|---|
Main Authors: | , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c297t-bda66c28b4149666b048ad50dad18c9b6fe796c0c011185264f5219f892b965a3 |
---|---|
cites | cdi_FETCH-LOGICAL-c297t-bda66c28b4149666b048ad50dad18c9b6fe796c0c011185264f5219f892b965a3 |
container_end_page | |
container_issue | |
container_start_page | 101941 |
container_title | Advanced engineering informatics |
container_volume | 56 |
creator | Bi, Shuai Hu, Zhengping Zhao, Mengyao Zhang, Hehao Di, Jirui Sun, Zhe |
description | Motion, as a feature of video that changes in temporal sequences, is crucial to visual understanding. The powerful video representation and extraction models are typically able to focus attention on motion features in challenging dynamic environments to complete more complex video understanding tasks. However, previous approaches discriminate mainly based on similar features in the spatial or temporal domain, ignoring the interdependence of consecutive video frames. In this paper, we propose the motion sensitive self-supervised collaborative network, a video representation learning framework that exploits a pretext task to assist feature comparison and strengthen the spatiotemporal discrimination power of the model. Specifically, we first propose the motion-aware module, which extracts consecutive motion features from the spatial regions by frame difference. The global–local contrastive module is then introduced, with context and enhanced video snippets being defined as appropriate positive samples for a broader feature similarity comparison. Finally, we introduce the snippet operation prediction module, which further assists contrastive learning to obtain more reliable global semantics by sensing changes in continuous frame features. Experimental results demonstrate that our work can effectively extract robust motion features and achieve competitive performance compared with other state-of-the-art self-supervised methods on downstream action recognition and video retrieval tasks.
•Accurate extraction of continuous motion features in complex environments.•Acquire global and local spatio-temporal features via correlation of sequences.•Obtain supervised information for feature discrimination in continuous clips.•Achieve competitive performance with limited data pre-training. |
doi_str_mv | 10.1016/j.aei.2023.101941 |
format | article |
fullrecord | <record><control><sourceid>elsevier_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1016_j_aei_2023_101941</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S1474034623000691</els_id><sourcerecordid>S1474034623000691</sourcerecordid><originalsourceid>FETCH-LOGICAL-c297t-bda66c28b4149666b048ad50dad18c9b6fe796c0c011185264f5219f892b965a3</originalsourceid><addsrcrecordid>eNp9kM1KAzEURoMoWKsP4C4vMDXJZDITXEnxDwpudB0yyY2kTpOSzIz49qata1f3u1zOx-UgdEvJihIq7rYrDX7FCKsPu-T0DC1o19ZVUzNyXjJveUVqLi7RVc5bUphOtgvk1zGMPkxxytglvQO8i6OPAWcI2Y9-hpIGV-VpD2n2GSw2cRh0H5M-XgOM3zF9YRcTnr2FiBPsExR81MeiAXQKPnxeowunhww3f3OJPp4e39cv1ebt-XX9sKkMk-1Y9VYLYVjXc8qlEKInvNO2IVZb2hnZCwetFIYYQintGia4axiVrpOsl6LR9RLRU69JMecETu2T3-n0oyhRB1dqq4ordXClTq4Kc39ioDw2e0gqGw_BgPUJzKhs9P_Qv-wAdHE</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Continuous frame motion sensitive self-supervised collaborative network for video representation learning</title><source>Elsevier</source><creator>Bi, Shuai ; Hu, Zhengping ; Zhao, Mengyao ; Zhang, Hehao ; Di, Jirui ; Sun, Zhe</creator><creatorcontrib>Bi, Shuai ; Hu, Zhengping ; Zhao, Mengyao ; Zhang, Hehao ; Di, Jirui ; Sun, Zhe</creatorcontrib><description>Motion, as a feature of video that changes in temporal sequences, is crucial to visual understanding. The powerful video representation and extraction models are typically able to focus attention on motion features in challenging dynamic environments to complete more complex video understanding tasks. However, previous approaches discriminate mainly based on similar features in the spatial or temporal domain, ignoring the interdependence of consecutive video frames. In this paper, we propose the motion sensitive self-supervised collaborative network, a video representation learning framework that exploits a pretext task to assist feature comparison and strengthen the spatiotemporal discrimination power of the model. Specifically, we first propose the motion-aware module, which extracts consecutive motion features from the spatial regions by frame difference. The global–local contrastive module is then introduced, with context and enhanced video snippets being defined as appropriate positive samples for a broader feature similarity comparison. Finally, we introduce the snippet operation prediction module, which further assists contrastive learning to obtain more reliable global semantics by sensing changes in continuous frame features. Experimental results demonstrate that our work can effectively extract robust motion features and achieve competitive performance compared with other state-of-the-art self-supervised methods on downstream action recognition and video retrieval tasks.
•Accurate extraction of continuous motion features in complex environments.•Acquire global and local spatio-temporal features via correlation of sequences.•Obtain supervised information for feature discrimination in continuous clips.•Achieve competitive performance with limited data pre-training.</description><identifier>ISSN: 1474-0346</identifier><identifier>EISSN: 1873-5320</identifier><identifier>DOI: 10.1016/j.aei.2023.101941</identifier><language>eng</language><publisher>Elsevier Ltd</publisher><subject>Action recognition ; Global–local contrastive learning ; Pretext task ; Self-supervised representation learning ; Video retrieval</subject><ispartof>Advanced engineering informatics, 2023-04, Vol.56, p.101941, Article 101941</ispartof><rights>2023 Elsevier Ltd</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c297t-bda66c28b4149666b048ad50dad18c9b6fe796c0c011185264f5219f892b965a3</citedby><cites>FETCH-LOGICAL-c297t-bda66c28b4149666b048ad50dad18c9b6fe796c0c011185264f5219f892b965a3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,777,781,27905,27906</link.rule.ids></links><search><creatorcontrib>Bi, Shuai</creatorcontrib><creatorcontrib>Hu, Zhengping</creatorcontrib><creatorcontrib>Zhao, Mengyao</creatorcontrib><creatorcontrib>Zhang, Hehao</creatorcontrib><creatorcontrib>Di, Jirui</creatorcontrib><creatorcontrib>Sun, Zhe</creatorcontrib><title>Continuous frame motion sensitive self-supervised collaborative network for video representation learning</title><title>Advanced engineering informatics</title><description>Motion, as a feature of video that changes in temporal sequences, is crucial to visual understanding. The powerful video representation and extraction models are typically able to focus attention on motion features in challenging dynamic environments to complete more complex video understanding tasks. However, previous approaches discriminate mainly based on similar features in the spatial or temporal domain, ignoring the interdependence of consecutive video frames. In this paper, we propose the motion sensitive self-supervised collaborative network, a video representation learning framework that exploits a pretext task to assist feature comparison and strengthen the spatiotemporal discrimination power of the model. Specifically, we first propose the motion-aware module, which extracts consecutive motion features from the spatial regions by frame difference. The global–local contrastive module is then introduced, with context and enhanced video snippets being defined as appropriate positive samples for a broader feature similarity comparison. Finally, we introduce the snippet operation prediction module, which further assists contrastive learning to obtain more reliable global semantics by sensing changes in continuous frame features. Experimental results demonstrate that our work can effectively extract robust motion features and achieve competitive performance compared with other state-of-the-art self-supervised methods on downstream action recognition and video retrieval tasks.
•Accurate extraction of continuous motion features in complex environments.•Acquire global and local spatio-temporal features via correlation of sequences.•Obtain supervised information for feature discrimination in continuous clips.•Achieve competitive performance with limited data pre-training.</description><subject>Action recognition</subject><subject>Global–local contrastive learning</subject><subject>Pretext task</subject><subject>Self-supervised representation learning</subject><subject>Video retrieval</subject><issn>1474-0346</issn><issn>1873-5320</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNp9kM1KAzEURoMoWKsP4C4vMDXJZDITXEnxDwpudB0yyY2kTpOSzIz49qata1f3u1zOx-UgdEvJihIq7rYrDX7FCKsPu-T0DC1o19ZVUzNyXjJveUVqLi7RVc5bUphOtgvk1zGMPkxxytglvQO8i6OPAWcI2Y9-hpIGV-VpD2n2GSw2cRh0H5M-XgOM3zF9YRcTnr2FiBPsExR81MeiAXQKPnxeowunhww3f3OJPp4e39cv1ebt-XX9sKkMk-1Y9VYLYVjXc8qlEKInvNO2IVZb2hnZCwetFIYYQintGia4axiVrpOsl6LR9RLRU69JMecETu2T3-n0oyhRB1dqq4ordXClTq4Kc39ioDw2e0gqGw_BgPUJzKhs9P_Qv-wAdHE</recordid><startdate>202304</startdate><enddate>202304</enddate><creator>Bi, Shuai</creator><creator>Hu, Zhengping</creator><creator>Zhao, Mengyao</creator><creator>Zhang, Hehao</creator><creator>Di, Jirui</creator><creator>Sun, Zhe</creator><general>Elsevier Ltd</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>202304</creationdate><title>Continuous frame motion sensitive self-supervised collaborative network for video representation learning</title><author>Bi, Shuai ; Hu, Zhengping ; Zhao, Mengyao ; Zhang, Hehao ; Di, Jirui ; Sun, Zhe</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c297t-bda66c28b4149666b048ad50dad18c9b6fe796c0c011185264f5219f892b965a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Action recognition</topic><topic>Global–local contrastive learning</topic><topic>Pretext task</topic><topic>Self-supervised representation learning</topic><topic>Video retrieval</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Bi, Shuai</creatorcontrib><creatorcontrib>Hu, Zhengping</creatorcontrib><creatorcontrib>Zhao, Mengyao</creatorcontrib><creatorcontrib>Zhang, Hehao</creatorcontrib><creatorcontrib>Di, Jirui</creatorcontrib><creatorcontrib>Sun, Zhe</creatorcontrib><collection>CrossRef</collection><jtitle>Advanced engineering informatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Bi, Shuai</au><au>Hu, Zhengping</au><au>Zhao, Mengyao</au><au>Zhang, Hehao</au><au>Di, Jirui</au><au>Sun, Zhe</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Continuous frame motion sensitive self-supervised collaborative network for video representation learning</atitle><jtitle>Advanced engineering informatics</jtitle><date>2023-04</date><risdate>2023</risdate><volume>56</volume><spage>101941</spage><pages>101941-</pages><artnum>101941</artnum><issn>1474-0346</issn><eissn>1873-5320</eissn><abstract>Motion, as a feature of video that changes in temporal sequences, is crucial to visual understanding. The powerful video representation and extraction models are typically able to focus attention on motion features in challenging dynamic environments to complete more complex video understanding tasks. However, previous approaches discriminate mainly based on similar features in the spatial or temporal domain, ignoring the interdependence of consecutive video frames. In this paper, we propose the motion sensitive self-supervised collaborative network, a video representation learning framework that exploits a pretext task to assist feature comparison and strengthen the spatiotemporal discrimination power of the model. Specifically, we first propose the motion-aware module, which extracts consecutive motion features from the spatial regions by frame difference. The global–local contrastive module is then introduced, with context and enhanced video snippets being defined as appropriate positive samples for a broader feature similarity comparison. Finally, we introduce the snippet operation prediction module, which further assists contrastive learning to obtain more reliable global semantics by sensing changes in continuous frame features. Experimental results demonstrate that our work can effectively extract robust motion features and achieve competitive performance compared with other state-of-the-art self-supervised methods on downstream action recognition and video retrieval tasks.
•Accurate extraction of continuous motion features in complex environments.•Acquire global and local spatio-temporal features via correlation of sequences.•Obtain supervised information for feature discrimination in continuous clips.•Achieve competitive performance with limited data pre-training.</abstract><pub>Elsevier Ltd</pub><doi>10.1016/j.aei.2023.101941</doi></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1474-0346 |
ispartof | Advanced engineering informatics, 2023-04, Vol.56, p.101941, Article 101941 |
issn | 1474-0346 1873-5320 |
language | eng |
recordid | cdi_crossref_primary_10_1016_j_aei_2023_101941 |
source | Elsevier |
subjects | Action recognition Global–local contrastive learning Pretext task Self-supervised representation learning Video retrieval |
title | Continuous frame motion sensitive self-supervised collaborative network for video representation learning |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-21T07%3A23%3A26IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-elsevier_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Continuous%20frame%20motion%20sensitive%20self-supervised%20collaborative%20network%20for%20video%20representation%20learning&rft.jtitle=Advanced%20engineering%20informatics&rft.au=Bi,%20Shuai&rft.date=2023-04&rft.volume=56&rft.spage=101941&rft.pages=101941-&rft.artnum=101941&rft.issn=1474-0346&rft.eissn=1873-5320&rft_id=info:doi/10.1016/j.aei.2023.101941&rft_dat=%3Celsevier_cross%3ES1474034623000691%3C/elsevier_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c297t-bda66c28b4149666b048ad50dad18c9b6fe796c0c011185264f5219f892b965a3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |