Loading…

Feature flow: In-network feature flow estimation for video object detection

•A type of shallow modules are proposed to directly predict the feature flow for feature alignment in a single network.•Self-supervision learning is introduced to further improve the quality of the predicted feature flow.•A new state-of-the-art performance is shown by comparing with other methods, w...

Full description

Saved in:
Bibliographic Details
Published in:Pattern recognition 2022-02, Vol.122, p.108323, Article 108323
Main Authors: Jin, Ruibing, Lin, Guosheng, Wen, Changyun, Wang, Jianliang, Liu, Fayao
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c306t-469cc3ae3b094d2834e63ee2db370a7327d0e815f595862c6c75829084e5e7903
cites cdi_FETCH-LOGICAL-c306t-469cc3ae3b094d2834e63ee2db370a7327d0e815f595862c6c75829084e5e7903
container_end_page
container_issue
container_start_page 108323
container_title Pattern recognition
container_volume 122
creator Jin, Ruibing
Lin, Guosheng
Wen, Changyun
Wang, Jianliang
Liu, Fayao
description •A type of shallow modules are proposed to directly predict the feature flow for feature alignment in a single network.•Self-supervision learning is introduced to further improve the quality of the predicted feature flow.•A new state-of-the-art performance is shown by comparing with other methods, while a fast inference speed is maintained. Optical flow, which expresses pixel displacement, is widely used in many computer vision tasks to provide pixel-level motion information. However, with the remarkable progress of the convolutional neural network, recent state-of-the-art approaches are proposed to solve problems directly on feature-level. Since the displacement of feature vector is not consistent with the pixel displacement, a common approach is to forward optical flow to a neural network and fine-tune this network on the task dataset. With this method, they expect the fine-tuned network to produce tensors encoding feature-level motion information. In this paper, we rethink about this de facto paradigm and analyze its drawbacks in the video object detection task. To mitigate these issues, we propose a novel network (IFF-Net) with an In-network Feature Flow estimation module (IFF module) for video object detection. Without resorting to pre-training on any additional dataset, our IFF module is able to directly produce feature flow which indicates the feature displacement. Our IFF module consists of a shallow module, which shares the features with the detection branches. This compact design enables our IFF-Net to accurately detect objects, while maintaining a fast inference speed. Furthermore, we propose a transformation residual loss (TRL) based on self-supervision, which further improves the performance of our IFF-Net. Our IFF-Net outperforms existing methods and achieves new state-of-the-art performance on ImageNet VID.
doi_str_mv 10.1016/j.patcog.2021.108323
format article
fullrecord <record><control><sourceid>elsevier_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1016_j_patcog_2021_108323</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0031320321005033</els_id><sourcerecordid>S0031320321005033</sourcerecordid><originalsourceid>FETCH-LOGICAL-c306t-469cc3ae3b094d2834e63ee2db370a7327d0e815f595862c6c75829084e5e7903</originalsourceid><addsrcrecordid>eNp9kEFLAzEUhIMoWKv_wEP-wNaXZHeT9SBIsVoseNFz2E3eSta6KUls8d-bsh48eRp4wwzzPkKuGSwYsPpmWOzaZPz7ggNn-aQEFydkxpQURcVKfkpmAIIVgoM4JxcxDgBMZmNGnlfYpq-AtN_6wy1dj8WI6eDDB-3_GBRjcp9tcn6kvQ907yx66rsBTaIWU5ZsXZKzvt1GvPrVOXlbPbwun4rNy-N6eb8pjIA6FWXdGCNaFB00peVKlFgLRG47IaGVgksLqFjVV02lam5qIyvFG1AlVigbEHNSTr0m-BgD9noX8rrwrRnoIxA96AmIPgLRE5Acu5timLftHQYdjcPRoHUhP6Ctd_8X_AAVJmrp</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Feature flow: In-network feature flow estimation for video object detection</title><source>ScienceDirect Freedom Collection 2022-2024</source><creator>Jin, Ruibing ; Lin, Guosheng ; Wen, Changyun ; Wang, Jianliang ; Liu, Fayao</creator><creatorcontrib>Jin, Ruibing ; Lin, Guosheng ; Wen, Changyun ; Wang, Jianliang ; Liu, Fayao</creatorcontrib><description>•A type of shallow modules are proposed to directly predict the feature flow for feature alignment in a single network.•Self-supervision learning is introduced to further improve the quality of the predicted feature flow.•A new state-of-the-art performance is shown by comparing with other methods, while a fast inference speed is maintained. Optical flow, which expresses pixel displacement, is widely used in many computer vision tasks to provide pixel-level motion information. However, with the remarkable progress of the convolutional neural network, recent state-of-the-art approaches are proposed to solve problems directly on feature-level. Since the displacement of feature vector is not consistent with the pixel displacement, a common approach is to forward optical flow to a neural network and fine-tune this network on the task dataset. With this method, they expect the fine-tuned network to produce tensors encoding feature-level motion information. In this paper, we rethink about this de facto paradigm and analyze its drawbacks in the video object detection task. To mitigate these issues, we propose a novel network (IFF-Net) with an In-network Feature Flow estimation module (IFF module) for video object detection. Without resorting to pre-training on any additional dataset, our IFF module is able to directly produce feature flow which indicates the feature displacement. Our IFF module consists of a shallow module, which shares the features with the detection branches. This compact design enables our IFF-Net to accurately detect objects, while maintaining a fast inference speed. Furthermore, we propose a transformation residual loss (TRL) based on self-supervision, which further improves the performance of our IFF-Net. Our IFF-Net outperforms existing methods and achieves new state-of-the-art performance on ImageNet VID.</description><identifier>ISSN: 0031-3203</identifier><identifier>EISSN: 1873-5142</identifier><identifier>DOI: 10.1016/j.patcog.2021.108323</identifier><language>eng</language><publisher>Elsevier Ltd</publisher><subject>Deep convolutional neural network (DCNN) ; Feature flow ; Object detection ; Video analysis ; Video object detection</subject><ispartof>Pattern recognition, 2022-02, Vol.122, p.108323, Article 108323</ispartof><rights>2021</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c306t-469cc3ae3b094d2834e63ee2db370a7327d0e815f595862c6c75829084e5e7903</citedby><cites>FETCH-LOGICAL-c306t-469cc3ae3b094d2834e63ee2db370a7327d0e815f595862c6c75829084e5e7903</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Jin, Ruibing</creatorcontrib><creatorcontrib>Lin, Guosheng</creatorcontrib><creatorcontrib>Wen, Changyun</creatorcontrib><creatorcontrib>Wang, Jianliang</creatorcontrib><creatorcontrib>Liu, Fayao</creatorcontrib><title>Feature flow: In-network feature flow estimation for video object detection</title><title>Pattern recognition</title><description>•A type of shallow modules are proposed to directly predict the feature flow for feature alignment in a single network.•Self-supervision learning is introduced to further improve the quality of the predicted feature flow.•A new state-of-the-art performance is shown by comparing with other methods, while a fast inference speed is maintained. Optical flow, which expresses pixel displacement, is widely used in many computer vision tasks to provide pixel-level motion information. However, with the remarkable progress of the convolutional neural network, recent state-of-the-art approaches are proposed to solve problems directly on feature-level. Since the displacement of feature vector is not consistent with the pixel displacement, a common approach is to forward optical flow to a neural network and fine-tune this network on the task dataset. With this method, they expect the fine-tuned network to produce tensors encoding feature-level motion information. In this paper, we rethink about this de facto paradigm and analyze its drawbacks in the video object detection task. To mitigate these issues, we propose a novel network (IFF-Net) with an In-network Feature Flow estimation module (IFF module) for video object detection. Without resorting to pre-training on any additional dataset, our IFF module is able to directly produce feature flow which indicates the feature displacement. Our IFF module consists of a shallow module, which shares the features with the detection branches. This compact design enables our IFF-Net to accurately detect objects, while maintaining a fast inference speed. Furthermore, we propose a transformation residual loss (TRL) based on self-supervision, which further improves the performance of our IFF-Net. Our IFF-Net outperforms existing methods and achieves new state-of-the-art performance on ImageNet VID.</description><subject>Deep convolutional neural network (DCNN)</subject><subject>Feature flow</subject><subject>Object detection</subject><subject>Video analysis</subject><subject>Video object detection</subject><issn>0031-3203</issn><issn>1873-5142</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNp9kEFLAzEUhIMoWKv_wEP-wNaXZHeT9SBIsVoseNFz2E3eSta6KUls8d-bsh48eRp4wwzzPkKuGSwYsPpmWOzaZPz7ggNn-aQEFydkxpQURcVKfkpmAIIVgoM4JxcxDgBMZmNGnlfYpq-AtN_6wy1dj8WI6eDDB-3_GBRjcp9tcn6kvQ907yx66rsBTaIWU5ZsXZKzvt1GvPrVOXlbPbwun4rNy-N6eb8pjIA6FWXdGCNaFB00peVKlFgLRG47IaGVgksLqFjVV02lam5qIyvFG1AlVigbEHNSTr0m-BgD9noX8rrwrRnoIxA96AmIPgLRE5Acu5timLftHQYdjcPRoHUhP6Ctd_8X_AAVJmrp</recordid><startdate>202202</startdate><enddate>202202</enddate><creator>Jin, Ruibing</creator><creator>Lin, Guosheng</creator><creator>Wen, Changyun</creator><creator>Wang, Jianliang</creator><creator>Liu, Fayao</creator><general>Elsevier Ltd</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>202202</creationdate><title>Feature flow: In-network feature flow estimation for video object detection</title><author>Jin, Ruibing ; Lin, Guosheng ; Wen, Changyun ; Wang, Jianliang ; Liu, Fayao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c306t-469cc3ae3b094d2834e63ee2db370a7327d0e815f595862c6c75829084e5e7903</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Deep convolutional neural network (DCNN)</topic><topic>Feature flow</topic><topic>Object detection</topic><topic>Video analysis</topic><topic>Video object detection</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Jin, Ruibing</creatorcontrib><creatorcontrib>Lin, Guosheng</creatorcontrib><creatorcontrib>Wen, Changyun</creatorcontrib><creatorcontrib>Wang, Jianliang</creatorcontrib><creatorcontrib>Liu, Fayao</creatorcontrib><collection>CrossRef</collection><jtitle>Pattern recognition</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Jin, Ruibing</au><au>Lin, Guosheng</au><au>Wen, Changyun</au><au>Wang, Jianliang</au><au>Liu, Fayao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Feature flow: In-network feature flow estimation for video object detection</atitle><jtitle>Pattern recognition</jtitle><date>2022-02</date><risdate>2022</risdate><volume>122</volume><spage>108323</spage><pages>108323-</pages><artnum>108323</artnum><issn>0031-3203</issn><eissn>1873-5142</eissn><abstract>•A type of shallow modules are proposed to directly predict the feature flow for feature alignment in a single network.•Self-supervision learning is introduced to further improve the quality of the predicted feature flow.•A new state-of-the-art performance is shown by comparing with other methods, while a fast inference speed is maintained. Optical flow, which expresses pixel displacement, is widely used in many computer vision tasks to provide pixel-level motion information. However, with the remarkable progress of the convolutional neural network, recent state-of-the-art approaches are proposed to solve problems directly on feature-level. Since the displacement of feature vector is not consistent with the pixel displacement, a common approach is to forward optical flow to a neural network and fine-tune this network on the task dataset. With this method, they expect the fine-tuned network to produce tensors encoding feature-level motion information. In this paper, we rethink about this de facto paradigm and analyze its drawbacks in the video object detection task. To mitigate these issues, we propose a novel network (IFF-Net) with an In-network Feature Flow estimation module (IFF module) for video object detection. Without resorting to pre-training on any additional dataset, our IFF module is able to directly produce feature flow which indicates the feature displacement. Our IFF module consists of a shallow module, which shares the features with the detection branches. This compact design enables our IFF-Net to accurately detect objects, while maintaining a fast inference speed. Furthermore, we propose a transformation residual loss (TRL) based on self-supervision, which further improves the performance of our IFF-Net. Our IFF-Net outperforms existing methods and achieves new state-of-the-art performance on ImageNet VID.</abstract><pub>Elsevier Ltd</pub><doi>10.1016/j.patcog.2021.108323</doi></addata></record>
fulltext fulltext
identifier ISSN: 0031-3203
ispartof Pattern recognition, 2022-02, Vol.122, p.108323, Article 108323
issn 0031-3203
1873-5142
language eng
recordid cdi_crossref_primary_10_1016_j_patcog_2021_108323
source ScienceDirect Freedom Collection 2022-2024
subjects Deep convolutional neural network (DCNN)
Feature flow
Object detection
Video analysis
Video object detection
title Feature flow: In-network feature flow estimation for video object detection
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-03T11%3A23%3A08IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-elsevier_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Feature%20flow:%20In-network%20feature%20flow%20estimation%20for%20video%20object%20detection&rft.jtitle=Pattern%20recognition&rft.au=Jin,%20Ruibing&rft.date=2022-02&rft.volume=122&rft.spage=108323&rft.pages=108323-&rft.artnum=108323&rft.issn=0031-3203&rft.eissn=1873-5142&rft_id=info:doi/10.1016/j.patcog.2021.108323&rft_dat=%3Celsevier_cross%3ES0031320321005033%3C/elsevier_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c306t-469cc3ae3b094d2834e63ee2db370a7327d0e815f595862c6c75829084e5e7903%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true