Loading…

Feature flow: In-network feature flow estimation for video object detection

•A type of shallow modules are proposed to directly predict the feature flow for feature alignment in a single network.•Self-supervision learning is introduced to further improve the quality of the predicted feature flow.•A new state-of-the-art performance is shown by comparing with other methods, w...

Full description

Saved in:

Bibliographic Details
Published in:	Pattern recognition 2022-02, Vol.122, p.108323, Article 108323
Main Authors:	Jin, Ruibing, Lin, Guosheng, Wen, Changyun, Wang, Jianliang, Liu, Fayao
Format:	Article
Language:	English
Subjects:	Deep convolutional neural network (DCNN) Feature flow Object detection Video analysis Video object detection
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c306t-469cc3ae3b094d2834e63ee2db370a7327d0e815f595862c6c75829084e5e7903
cites	cdi_FETCH-LOGICAL-c306t-469cc3ae3b094d2834e63ee2db370a7327d0e815f595862c6c75829084e5e7903
container_end_page
container_issue
container_start_page	108323
container_title	Pattern recognition
container_volume	122
creator	Jin, Ruibing Lin, Guosheng Wen, Changyun Wang, Jianliang Liu, Fayao
description	•A type of shallow modules are proposed to directly predict the feature flow for feature alignment in a single network.•Self-supervision learning is introduced to further improve the quality of the predicted feature flow.•A new state-of-the-art performance is shown by comparing with other methods, while a fast inference speed is maintained. Optical flow, which expresses pixel displacement, is widely used in many computer vision tasks to provide pixel-level motion information. However, with the remarkable progress of the convolutional neural network, recent state-of-the-art approaches are proposed to solve problems directly on feature-level. Since the displacement of feature vector is not consistent with the pixel displacement, a common approach is to forward optical flow to a neural network and fine-tune this network on the task dataset. With this method, they expect the fine-tuned network to produce tensors encoding feature-level motion information. In this paper, we rethink about this de facto paradigm and analyze its drawbacks in the video object detection task. To mitigate these issues, we propose a novel network (IFF-Net) with an In-network Feature Flow estimation module (IFF module) for video object detection. Without resorting to pre-training on any additional dataset, our IFF module is able to directly produce feature flow which indicates the feature displacement. Our IFF module consists of a shallow module, which shares the features with the detection branches. This compact design enables our IFF-Net to accurately detect objects, while maintaining a fast inference speed. Furthermore, we propose a transformation residual loss (TRL) based on self-supervision, which further improves the performance of our IFF-Net. Our IFF-Net outperforms existing methods and achieves new state-of-the-art performance on ImageNet VID.
doi_str_mv	10.1016/j.patcog.2021.108323
format	article
fullrecord	<record><control><sourceid>elsevier_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1016_j_patcog_2021_108323</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0031320321005033</els_id><sourcerecordid>S0031320321005033</sourcerecordid><originalsourceid>FETCH-LOGICAL-c306t-469cc3ae3b094d2834e63ee2db370a7327d0e815f595862c6c75829084e5e7903</originalsourceid><addsrcrecordid>eNp9kEFLAzEUhIMoWKv_wEP-wNaXZHeT9SBIsVoseNFz2E3eSta6KUls8d-bsh48eRp4wwzzPkKuGSwYsPpmWOzaZPz7ggNn-aQEFydkxpQURcVKfkpmAIIVgoM4JxcxDgBMZmNGnlfYpq-AtN_6wy1dj8WI6eDDB-3_GBRjcp9tcn6kvQ907yx66rsBTaIWU5ZsXZKzvt1GvPrVOXlbPbwun4rNy-N6eb8pjIA6FWXdGCNaFB00peVKlFgLRG47IaGVgksLqFjVV02lam5qIyvFG1AlVigbEHNSTr0m-BgD9noX8rrwrRnoIxA96AmIPgLRE5Acu5timLftHQYdjcPRoHUhP6Ctd_8X_AAVJmrp</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Feature flow: In-network feature flow estimation for video object detection</title><source>ScienceDirect Freedom Collection 2022-2024</source><creator>Jin, Ruibing ; Lin, Guosheng ; Wen, Changyun ; Wang, Jianliang ; Liu, Fayao</creator><creatorcontrib>Jin, Ruibing ; Lin, Guosheng ; Wen, Changyun ; Wang, Jianliang ; Liu, Fayao</creatorcontrib><description>•A type of shallow modules are proposed to directly predict the feature flow for feature alignment in a single network.•Self-supervision learning is introduced to further improve the quality of the predicted feature flow.•A new state-of-the-art performance is shown by comparing with other methods, while a fast inference speed is maintained. Optical flow, which expresses pixel displacement, is widely used in many computer vision tasks to provide pixel-level motion information. However, with the remarkable progress of the convolutional neural network, recent state-of-the-art approaches are proposed to solve problems directly on feature-level. Since the displacement of feature vector is not consistent with the pixel displacement, a common approach is to forward optical flow to a neural network and fine-tune this network on the task dataset. With this method, they expect the fine-tuned network to produce tensors encoding feature-level motion information. In this paper, we rethink about this de facto paradigm and analyze its drawbacks in the video object detection task. To mitigate these issues, we propose a novel network (IFF-Net) with an In-network Feature Flow estimation module (IFF module) for video object detection. Without resorting to pre-training on any additional dataset, our IFF module is able to directly produce feature flow which indicates the feature displacement. Our IFF module consists of a shallow module, which shares the features with the detection branches. This compact design enables our IFF-Net to accurately detect objects, while maintaining a fast inference speed. Furthermore, we propose a transformation residual loss (TRL) based on self-supervision, which further improves the performance of our IFF-Net. Our IFF-Net outperforms existing methods and achieves new state-of-the-art performance on ImageNet VID.</description><identifier>ISSN: 0031-3203</identifier><identifier>EISSN: 1873-5142</identifier><identifier>DOI: 10.1016/j.patcog.2021.108323</identifier><language>eng</language><publisher>Elsevier Ltd</publisher><subject>Deep convolutional neural network (DCNN) ; Feature flow ; Object detection ; Video analysis ; Video object detection</subject><ispartof>Pattern recognition, 2022-02, Vol.122, p.108323, Article 108323</ispartof><rights>2021</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c306t-469cc3ae3b094d2834e63ee2db370a7327d0e815f595862c6c75829084e5e7903</citedby><cites>FETCH-LOGICAL-c306t-469cc3ae3b094d2834e63ee2db370a7327d0e815f595862c6c75829084e5e7903</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Jin, Ruibing</creatorcontrib><creatorcontrib>Lin, Guosheng</creatorcontrib><creatorcontrib>Wen, Changyun</creatorcontrib><creatorcontrib>Wang, Jianliang</creatorcontrib><creatorcontrib>Liu, Fayao</creatorcontrib><title>Feature flow: In-network feature flow estimation for video object detection</title><title>Pattern recognition</title><description>•A type of shallow modules are proposed to directly predict the feature flow for feature alignment in a single network.•Self-supervision learning is introduced to further improve the quality of the predicted feature flow.•A new state-of-the-art performance is shown by comparing with other methods, while a fast inference speed is maintained. Optical flow, which expresses pixel displacement, is widely used in many computer vision tasks to provide pixel-level motion information. However, with the remarkable progress of the convolutional neural network, recent state-of-the-art approaches are proposed to solve problems directly on feature-level. Since the displacement of feature vector is not consistent with the pixel displacement, a common approach is to forward optical flow to a neural network and fine-tune this network on the task dataset. With this method, they expect the fine-tuned network to produce tensors encoding feature-level motion information. In this paper, we rethink about this de facto paradigm and analyze its drawbacks in the video object detection task. To mitigate these issues, we propose a novel network (IFF-Net) with an In-network Feature Flow estimation module (IFF module) for video object detection. Without resorting to pre-training on any additional dataset, our IFF module is able to directly produce feature flow which indicates the feature displacement. Our IFF module consists of a shallow module, which shares the features with the detection branches. This compact design enables our IFF-Net to accurately detect objects, while maintaining a fast inference speed. Furthermore, we propose a transformation residual loss (TRL) based on self-supervision, which further improves the performance of our IFF-Net. Our IFF-Net outperforms existing methods and achieves new state-of-the-art performance on ImageNet VID.</description><subject>Deep convolutional neural network (DCNN)</subject><subject>Feature flow</subject><subject>Object detection</subject><subject>Video analysis</subject><subject>Video object detection</subject><issn>0031-3203</issn><issn>1873-5142</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNp9kEFLAzEUhIMoWKv_wEP-wNaXZHeT9SBIsVoseNFz2E3eSta6KUls8d-bsh48eRp4wwzzPkKuGSwYsPpmWOzaZPz7ggNn-aQEFydkxpQURcVKfkpmAIIVgoM4JxcxDgBMZmNGnlfYpq-AtN_6wy1dj8WI6eDDB-3_GBRjcp9tcn6kvQ907yx66rsBTaIWU5ZsXZKzvt1GvPrVOXlbPbwun4rNy-N6eb8pjIA6FWXdGCNaFB00peVKlFgLRG47IaGVgksLqFjVV02lam5qIyvFG1AlVigbEHNSTr0m-BgD9noX8rrwrRnoIxA96AmIPgLRE5Acu5timLftHQYdjcPRoHUhP6Ctd_8X_AAVJmrp</recordid><startdate>202202</startdate><enddate>202202</enddate><creator>Jin, Ruibing</creator><creator>Lin, Guosheng</creator><creator>Wen, Changyun</creator><creator>Wang, Jianliang</creator><creator>Liu, Fayao</creator><general>Elsevier Ltd</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>202202</creationdate><title>Feature flow: In-network feature flow estimation for video object detection</title><author>Jin, Ruibing ; Lin, Guosheng ; Wen, Changyun ; Wang, Jianliang ; Liu, Fayao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c306t-469cc3ae3b094d2834e63ee2db370a7327d0e815f595862c6c75829084e5e7903</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Deep convolutional neural network (DCNN)</topic><topic>Feature flow</topic><topic>Object detection</topic><topic>Video analysis</topic><topic>Video object detection</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Jin, Ruibing</creatorcontrib><creatorcontrib>Lin, Guosheng</creatorcontrib><creatorcontrib>Wen, Changyun</creatorcontrib><creatorcontrib>Wang, Jianliang</creatorcontrib><creatorcontrib>Liu, Fayao</creatorcontrib><collection>CrossRef</collection><jtitle>Pattern recognition</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Jin, Ruibing</au><au>Lin, Guosheng</au><au>Wen, Changyun</au><au>Wang, Jianliang</au><au>Liu, Fayao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Feature flow: In-network feature flow estimation for video object detection</atitle><jtitle>Pattern recognition</jtitle><date>2022-02</date><risdate>2022</risdate><volume>122</volume><spage>108323</spage><pages>108323-</pages><artnum>108323</artnum><issn>0031-3203</issn><eissn>1873-5142</eissn><abstract>•A type of shallow modules are proposed to directly predict the feature flow for feature alignment in a single network.•Self-supervision learning is introduced to further improve the quality of the predicted feature flow.•A new state-of-the-art performance is shown by comparing with other methods, while a fast inference speed is maintained. Optical flow, which expresses pixel displacement, is widely used in many computer vision tasks to provide pixel-level motion information. However, with the remarkable progress of the convolutional neural network, recent state-of-the-art approaches are proposed to solve problems directly on feature-level. Since the displacement of feature vector is not consistent with the pixel displacement, a common approach is to forward optical flow to a neural network and fine-tune this network on the task dataset. With this method, they expect the fine-tuned network to produce tensors encoding feature-level motion information. In this paper, we rethink about this de facto paradigm and analyze its drawbacks in the video object detection task. To mitigate these issues, we propose a novel network (IFF-Net) with an In-network Feature Flow estimation module (IFF module) for video object detection. Without resorting to pre-training on any additional dataset, our IFF module is able to directly produce feature flow which indicates the feature displacement. Our IFF module consists of a shallow module, which shares the features with the detection branches. This compact design enables our IFF-Net to accurately detect objects, while maintaining a fast inference speed. Furthermore, we propose a transformation residual loss (TRL) based on self-supervision, which further improves the performance of our IFF-Net. Our IFF-Net outperforms existing methods and achieves new state-of-the-art performance on ImageNet VID.</abstract><pub>Elsevier Ltd</pub><doi>10.1016/j.patcog.2021.108323</doi></addata></record>
fulltext	fulltext
identifier	ISSN: 0031-3203
ispartof	Pattern recognition, 2022-02, Vol.122, p.108323, Article 108323
issn	0031-3203 1873-5142
language	eng
recordid	cdi_crossref_primary_10_1016_j_patcog_2021_108323
source	ScienceDirect Freedom Collection 2022-2024
subjects	Deep convolutional neural network (DCNN) Feature flow Object detection Video analysis Video object detection
title	Feature flow: In-network feature flow estimation for video object detection
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-03T11%3A23%3A08IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-elsevier_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Feature%20flow:%20In-network%20feature%20flow%20estimation%20for%20video%20object%20detection&rft.jtitle=Pattern%20recognition&rft.au=Jin,%20Ruibing&rft.date=2022-02&rft.volume=122&rft.spage=108323&rft.pages=108323-&rft.artnum=108323&rft.issn=0031-3203&rft.eissn=1873-5142&rft_id=info:doi/10.1016/j.patcog.2021.108323&rft_dat=%3Celsevier_cross%3ES0031320321005033%3C/elsevier_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c306t-469cc3ae3b094d2834e63ee2db370a7327d0e815f595862c6c75829084e5e7903%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true