Loading…

Temporal Feature Fusion for 3D Detection in Monocular Video

Previous monocular 3D detection works focus on the single frame input in both training and inference. In real-world applications, temporal and motion information naturally exists in monocular video. It is valuable for 3D detection but under-explored in monocular works. In this paper, we propose a st...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on image processing 2024-01, Vol.33, p.2665-2675
Main Authors: Cheng, Haoran, Peng, Liang, Yang, Zheng, Lin, Binbin, He, Xiaofei, Wu, Boxi
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c301t-13fbc10dc64baa345e8e90ad61e41f6e4736b33428962a04947a6be57a2ab623
container_end_page 2675
container_issue
container_start_page 2665
container_title IEEE transactions on image processing
container_volume 33
creator Cheng, Haoran
Peng, Liang
Yang, Zheng
Lin, Binbin
He, Xiaofei
Wu, Boxi
description Previous monocular 3D detection works focus on the single frame input in both training and inference. In real-world applications, temporal and motion information naturally exists in monocular video. It is valuable for 3D detection but under-explored in monocular works. In this paper, we propose a straightforward and effective method for temporal feature fusion, which exhibits low computation cost and excellent transferability, making it conveniently applicable to various monocular models. Specifically, with the help of optical flow, we transform the backbone features produced by prior frames and fuse them into the current frame. We introduce the scene feature propagating mechanism, which accumulates history scene features without extra time-consuming. In this process, occluded areas are removed via forward-backward scene consistency. Our method naturally introduces valuable temporal features, facilitating 3D reasoning in monocular 3D detection. Furthermore, accumulated history scene features via scene propagating mitigate heavy computation overheads for video processing. Experiments are conducted on variant baselines, which demonstrate that the proposed method is model-agonistic and can bring significant improvement to multiple types of single-frame methods.
doi_str_mv 10.1109/TIP.2024.3378475
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TIP_2024_3378475</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10478289</ieee_id><sourcerecordid>3003439874</sourcerecordid><originalsourceid>FETCH-LOGICAL-c301t-13fbc10dc64baa345e8e90ad61e41f6e4736b33428962a04947a6be57a2ab623</originalsourceid><addsrcrecordid>eNpdkEFLw0AQRhdRrFbvHkQCXrykzmY2u1k8SWu1UNFD8Bo26QRS0mzdTQ7-e7e0iniaYXjzMfMYu-Iw4Rz0fb54nySQiAmiyoRKj9gZ14LHACI5Dj2kKlZc6BE7934NwEXK5SkbYZYiKORn7CGnzdY600ZzMv3gKJoPvrFdVFsX4SyaUU9Vvxs0XfRqO1sNrXHRR7Mie8FOatN6ujzUMcvnT_n0JV6-PS-mj8u4QuB9zLEuKw6rSorSGBQpZaTBrCQnwWtJQqEsEUWSaZkYEFooI0tKlUlMKRMcs7t97NbZz4F8X2waX1Hbmo7s4AsEQIE6UyKgt__QtR1cF44LFHLUGqQMFOypylnvHdXF1jUb474KDsXOaxG8FjuvxcFrWLk5BA_lhla_Cz8iA3C9Bxoi-pMnVBb-wm8QU3lB</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3031399066</pqid></control><display><type>article</type><title>Temporal Feature Fusion for 3D Detection in Monocular Video</title><source>IEEE Electronic Library (IEL) Journals</source><creator>Cheng, Haoran ; Peng, Liang ; Yang, Zheng ; Lin, Binbin ; He, Xiaofei ; Wu, Boxi</creator><creatorcontrib>Cheng, Haoran ; Peng, Liang ; Yang, Zheng ; Lin, Binbin ; He, Xiaofei ; Wu, Boxi</creatorcontrib><description>Previous monocular 3D detection works focus on the single frame input in both training and inference. In real-world applications, temporal and motion information naturally exists in monocular video. It is valuable for 3D detection but under-explored in monocular works. In this paper, we propose a straightforward and effective method for temporal feature fusion, which exhibits low computation cost and excellent transferability, making it conveniently applicable to various monocular models. Specifically, with the help of optical flow, we transform the backbone features produced by prior frames and fuse them into the current frame. We introduce the scene feature propagating mechanism, which accumulates history scene features without extra time-consuming. In this process, occluded areas are removed via forward-backward scene consistency. Our method naturally introduces valuable temporal features, facilitating 3D reasoning in monocular 3D detection. Furthermore, accumulated history scene features via scene propagating mitigate heavy computation overheads for video processing. Experiments are conducted on variant baselines, which demonstrate that the proposed method is model-agonistic and can bring significant improvement to multiple types of single-frame methods.</description><identifier>ISSN: 1057-7149</identifier><identifier>EISSN: 1941-0042</identifier><identifier>DOI: 10.1109/TIP.2024.3378475</identifier><identifier>PMID: 38530731</identifier><identifier>CODEN: IIPRE4</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Computation ; Feature extraction ; History ; Image processing ; Laser radar ; Monocular 3D object detection ; Object detection ; Optical flow ; Optical flow (image analysis) ; Point cloud compression ; temporal information ; Three-dimensional displays ; Video</subject><ispartof>IEEE transactions on image processing, 2024-01, Vol.33, p.2665-2675</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c301t-13fbc10dc64baa345e8e90ad61e41f6e4736b33428962a04947a6be57a2ab623</cites><orcidid>0009-0000-3950-2700 ; 0000-0003-4494-193X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10478289$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27922,27923,54794</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/38530731$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Cheng, Haoran</creatorcontrib><creatorcontrib>Peng, Liang</creatorcontrib><creatorcontrib>Yang, Zheng</creatorcontrib><creatorcontrib>Lin, Binbin</creatorcontrib><creatorcontrib>He, Xiaofei</creatorcontrib><creatorcontrib>Wu, Boxi</creatorcontrib><title>Temporal Feature Fusion for 3D Detection in Monocular Video</title><title>IEEE transactions on image processing</title><addtitle>TIP</addtitle><addtitle>IEEE Trans Image Process</addtitle><description>Previous monocular 3D detection works focus on the single frame input in both training and inference. In real-world applications, temporal and motion information naturally exists in monocular video. It is valuable for 3D detection but under-explored in monocular works. In this paper, we propose a straightforward and effective method for temporal feature fusion, which exhibits low computation cost and excellent transferability, making it conveniently applicable to various monocular models. Specifically, with the help of optical flow, we transform the backbone features produced by prior frames and fuse them into the current frame. We introduce the scene feature propagating mechanism, which accumulates history scene features without extra time-consuming. In this process, occluded areas are removed via forward-backward scene consistency. Our method naturally introduces valuable temporal features, facilitating 3D reasoning in monocular 3D detection. Furthermore, accumulated history scene features via scene propagating mitigate heavy computation overheads for video processing. Experiments are conducted on variant baselines, which demonstrate that the proposed method is model-agonistic and can bring significant improvement to multiple types of single-frame methods.</description><subject>Computation</subject><subject>Feature extraction</subject><subject>History</subject><subject>Image processing</subject><subject>Laser radar</subject><subject>Monocular 3D object detection</subject><subject>Object detection</subject><subject>Optical flow</subject><subject>Optical flow (image analysis)</subject><subject>Point cloud compression</subject><subject>temporal information</subject><subject>Three-dimensional displays</subject><subject>Video</subject><issn>1057-7149</issn><issn>1941-0042</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNpdkEFLw0AQRhdRrFbvHkQCXrykzmY2u1k8SWu1UNFD8Bo26QRS0mzdTQ7-e7e0iniaYXjzMfMYu-Iw4Rz0fb54nySQiAmiyoRKj9gZ14LHACI5Dj2kKlZc6BE7934NwEXK5SkbYZYiKORn7CGnzdY600ZzMv3gKJoPvrFdVFsX4SyaUU9Vvxs0XfRqO1sNrXHRR7Mie8FOatN6ujzUMcvnT_n0JV6-PS-mj8u4QuB9zLEuKw6rSorSGBQpZaTBrCQnwWtJQqEsEUWSaZkYEFooI0tKlUlMKRMcs7t97NbZz4F8X2waX1Hbmo7s4AsEQIE6UyKgt__QtR1cF44LFHLUGqQMFOypylnvHdXF1jUb474KDsXOaxG8FjuvxcFrWLk5BA_lhla_Cz8iA3C9Bxoi-pMnVBb-wm8QU3lB</recordid><startdate>20240101</startdate><enddate>20240101</enddate><creator>Cheng, Haoran</creator><creator>Peng, Liang</creator><creator>Yang, Zheng</creator><creator>Lin, Binbin</creator><creator>He, Xiaofei</creator><creator>Wu, Boxi</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><orcidid>https://orcid.org/0009-0000-3950-2700</orcidid><orcidid>https://orcid.org/0000-0003-4494-193X</orcidid></search><sort><creationdate>20240101</creationdate><title>Temporal Feature Fusion for 3D Detection in Monocular Video</title><author>Cheng, Haoran ; Peng, Liang ; Yang, Zheng ; Lin, Binbin ; He, Xiaofei ; Wu, Boxi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c301t-13fbc10dc64baa345e8e90ad61e41f6e4736b33428962a04947a6be57a2ab623</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computation</topic><topic>Feature extraction</topic><topic>History</topic><topic>Image processing</topic><topic>Laser radar</topic><topic>Monocular 3D object detection</topic><topic>Object detection</topic><topic>Optical flow</topic><topic>Optical flow (image analysis)</topic><topic>Point cloud compression</topic><topic>temporal information</topic><topic>Three-dimensional displays</topic><topic>Video</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Cheng, Haoran</creatorcontrib><creatorcontrib>Peng, Liang</creatorcontrib><creatorcontrib>Yang, Zheng</creatorcontrib><creatorcontrib>Lin, Binbin</creatorcontrib><creatorcontrib>He, Xiaofei</creatorcontrib><creatorcontrib>Wu, Boxi</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library Online</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on image processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Cheng, Haoran</au><au>Peng, Liang</au><au>Yang, Zheng</au><au>Lin, Binbin</au><au>He, Xiaofei</au><au>Wu, Boxi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Temporal Feature Fusion for 3D Detection in Monocular Video</atitle><jtitle>IEEE transactions on image processing</jtitle><stitle>TIP</stitle><addtitle>IEEE Trans Image Process</addtitle><date>2024-01-01</date><risdate>2024</risdate><volume>33</volume><spage>2665</spage><epage>2675</epage><pages>2665-2675</pages><issn>1057-7149</issn><eissn>1941-0042</eissn><coden>IIPRE4</coden><abstract>Previous monocular 3D detection works focus on the single frame input in both training and inference. In real-world applications, temporal and motion information naturally exists in monocular video. It is valuable for 3D detection but under-explored in monocular works. In this paper, we propose a straightforward and effective method for temporal feature fusion, which exhibits low computation cost and excellent transferability, making it conveniently applicable to various monocular models. Specifically, with the help of optical flow, we transform the backbone features produced by prior frames and fuse them into the current frame. We introduce the scene feature propagating mechanism, which accumulates history scene features without extra time-consuming. In this process, occluded areas are removed via forward-backward scene consistency. Our method naturally introduces valuable temporal features, facilitating 3D reasoning in monocular 3D detection. Furthermore, accumulated history scene features via scene propagating mitigate heavy computation overheads for video processing. Experiments are conducted on variant baselines, which demonstrate that the proposed method is model-agonistic and can bring significant improvement to multiple types of single-frame methods.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>38530731</pmid><doi>10.1109/TIP.2024.3378475</doi><tpages>11</tpages><orcidid>https://orcid.org/0009-0000-3950-2700</orcidid><orcidid>https://orcid.org/0000-0003-4494-193X</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 1057-7149
ispartof IEEE transactions on image processing, 2024-01, Vol.33, p.2665-2675
issn 1057-7149
1941-0042
language eng
recordid cdi_crossref_primary_10_1109_TIP_2024_3378475
source IEEE Electronic Library (IEL) Journals
subjects Computation
Feature extraction
History
Image processing
Laser radar
Monocular 3D object detection
Object detection
Optical flow
Optical flow (image analysis)
Point cloud compression
temporal information
Three-dimensional displays
Video
title Temporal Feature Fusion for 3D Detection in Monocular Video
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-13T17%3A25%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Temporal%20Feature%20Fusion%20for%203D%20Detection%20in%20Monocular%20Video&rft.jtitle=IEEE%20transactions%20on%20image%20processing&rft.au=Cheng,%20Haoran&rft.date=2024-01-01&rft.volume=33&rft.spage=2665&rft.epage=2675&rft.pages=2665-2675&rft.issn=1057-7149&rft.eissn=1941-0042&rft.coden=IIPRE4&rft_id=info:doi/10.1109/TIP.2024.3378475&rft_dat=%3Cproquest_cross%3E3003439874%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c301t-13fbc10dc64baa345e8e90ad61e41f6e4736b33428962a04947a6be57a2ab623%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3031399066&rft_id=info:pmid/38530731&rft_ieee_id=10478289&rfr_iscdi=true