Loading…
Temporal Feature Fusion for 3D Detection in Monocular Video
Previous monocular 3D detection works focus on the single frame input in both training and inference. In real-world applications, temporal and motion information naturally exists in monocular video. It is valuable for 3D detection but under-explored in monocular works. In this paper, we propose a st...
Saved in:
Published in: | IEEE transactions on image processing 2024-01, Vol.33, p.2665-2675 |
---|---|
Main Authors: | , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | cdi_FETCH-LOGICAL-c301t-13fbc10dc64baa345e8e90ad61e41f6e4736b33428962a04947a6be57a2ab623 |
container_end_page | 2675 |
container_issue | |
container_start_page | 2665 |
container_title | IEEE transactions on image processing |
container_volume | 33 |
creator | Cheng, Haoran Peng, Liang Yang, Zheng Lin, Binbin He, Xiaofei Wu, Boxi |
description | Previous monocular 3D detection works focus on the single frame input in both training and inference. In real-world applications, temporal and motion information naturally exists in monocular video. It is valuable for 3D detection but under-explored in monocular works. In this paper, we propose a straightforward and effective method for temporal feature fusion, which exhibits low computation cost and excellent transferability, making it conveniently applicable to various monocular models. Specifically, with the help of optical flow, we transform the backbone features produced by prior frames and fuse them into the current frame. We introduce the scene feature propagating mechanism, which accumulates history scene features without extra time-consuming. In this process, occluded areas are removed via forward-backward scene consistency. Our method naturally introduces valuable temporal features, facilitating 3D reasoning in monocular 3D detection. Furthermore, accumulated history scene features via scene propagating mitigate heavy computation overheads for video processing. Experiments are conducted on variant baselines, which demonstrate that the proposed method is model-agonistic and can bring significant improvement to multiple types of single-frame methods. |
doi_str_mv | 10.1109/TIP.2024.3378475 |
format | article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TIP_2024_3378475</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10478289</ieee_id><sourcerecordid>3003439874</sourcerecordid><originalsourceid>FETCH-LOGICAL-c301t-13fbc10dc64baa345e8e90ad61e41f6e4736b33428962a04947a6be57a2ab623</originalsourceid><addsrcrecordid>eNpdkEFLw0AQRhdRrFbvHkQCXrykzmY2u1k8SWu1UNFD8Bo26QRS0mzdTQ7-e7e0iniaYXjzMfMYu-Iw4Rz0fb54nySQiAmiyoRKj9gZ14LHACI5Dj2kKlZc6BE7934NwEXK5SkbYZYiKORn7CGnzdY600ZzMv3gKJoPvrFdVFsX4SyaUU9Vvxs0XfRqO1sNrXHRR7Mie8FOatN6ujzUMcvnT_n0JV6-PS-mj8u4QuB9zLEuKw6rSorSGBQpZaTBrCQnwWtJQqEsEUWSaZkYEFooI0tKlUlMKRMcs7t97NbZz4F8X2waX1Hbmo7s4AsEQIE6UyKgt__QtR1cF44LFHLUGqQMFOypylnvHdXF1jUb474KDsXOaxG8FjuvxcFrWLk5BA_lhla_Cz8iA3C9Bxoi-pMnVBb-wm8QU3lB</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3031399066</pqid></control><display><type>article</type><title>Temporal Feature Fusion for 3D Detection in Monocular Video</title><source>IEEE Electronic Library (IEL) Journals</source><creator>Cheng, Haoran ; Peng, Liang ; Yang, Zheng ; Lin, Binbin ; He, Xiaofei ; Wu, Boxi</creator><creatorcontrib>Cheng, Haoran ; Peng, Liang ; Yang, Zheng ; Lin, Binbin ; He, Xiaofei ; Wu, Boxi</creatorcontrib><description>Previous monocular 3D detection works focus on the single frame input in both training and inference. In real-world applications, temporal and motion information naturally exists in monocular video. It is valuable for 3D detection but under-explored in monocular works. In this paper, we propose a straightforward and effective method for temporal feature fusion, which exhibits low computation cost and excellent transferability, making it conveniently applicable to various monocular models. Specifically, with the help of optical flow, we transform the backbone features produced by prior frames and fuse them into the current frame. We introduce the scene feature propagating mechanism, which accumulates history scene features without extra time-consuming. In this process, occluded areas are removed via forward-backward scene consistency. Our method naturally introduces valuable temporal features, facilitating 3D reasoning in monocular 3D detection. Furthermore, accumulated history scene features via scene propagating mitigate heavy computation overheads for video processing. Experiments are conducted on variant baselines, which demonstrate that the proposed method is model-agonistic and can bring significant improvement to multiple types of single-frame methods.</description><identifier>ISSN: 1057-7149</identifier><identifier>EISSN: 1941-0042</identifier><identifier>DOI: 10.1109/TIP.2024.3378475</identifier><identifier>PMID: 38530731</identifier><identifier>CODEN: IIPRE4</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Computation ; Feature extraction ; History ; Image processing ; Laser radar ; Monocular 3D object detection ; Object detection ; Optical flow ; Optical flow (image analysis) ; Point cloud compression ; temporal information ; Three-dimensional displays ; Video</subject><ispartof>IEEE transactions on image processing, 2024-01, Vol.33, p.2665-2675</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c301t-13fbc10dc64baa345e8e90ad61e41f6e4736b33428962a04947a6be57a2ab623</cites><orcidid>0009-0000-3950-2700 ; 0000-0003-4494-193X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10478289$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27922,27923,54794</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/38530731$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Cheng, Haoran</creatorcontrib><creatorcontrib>Peng, Liang</creatorcontrib><creatorcontrib>Yang, Zheng</creatorcontrib><creatorcontrib>Lin, Binbin</creatorcontrib><creatorcontrib>He, Xiaofei</creatorcontrib><creatorcontrib>Wu, Boxi</creatorcontrib><title>Temporal Feature Fusion for 3D Detection in Monocular Video</title><title>IEEE transactions on image processing</title><addtitle>TIP</addtitle><addtitle>IEEE Trans Image Process</addtitle><description>Previous monocular 3D detection works focus on the single frame input in both training and inference. In real-world applications, temporal and motion information naturally exists in monocular video. It is valuable for 3D detection but under-explored in monocular works. In this paper, we propose a straightforward and effective method for temporal feature fusion, which exhibits low computation cost and excellent transferability, making it conveniently applicable to various monocular models. Specifically, with the help of optical flow, we transform the backbone features produced by prior frames and fuse them into the current frame. We introduce the scene feature propagating mechanism, which accumulates history scene features without extra time-consuming. In this process, occluded areas are removed via forward-backward scene consistency. Our method naturally introduces valuable temporal features, facilitating 3D reasoning in monocular 3D detection. Furthermore, accumulated history scene features via scene propagating mitigate heavy computation overheads for video processing. Experiments are conducted on variant baselines, which demonstrate that the proposed method is model-agonistic and can bring significant improvement to multiple types of single-frame methods.</description><subject>Computation</subject><subject>Feature extraction</subject><subject>History</subject><subject>Image processing</subject><subject>Laser radar</subject><subject>Monocular 3D object detection</subject><subject>Object detection</subject><subject>Optical flow</subject><subject>Optical flow (image analysis)</subject><subject>Point cloud compression</subject><subject>temporal information</subject><subject>Three-dimensional displays</subject><subject>Video</subject><issn>1057-7149</issn><issn>1941-0042</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNpdkEFLw0AQRhdRrFbvHkQCXrykzmY2u1k8SWu1UNFD8Bo26QRS0mzdTQ7-e7e0iniaYXjzMfMYu-Iw4Rz0fb54nySQiAmiyoRKj9gZ14LHACI5Dj2kKlZc6BE7934NwEXK5SkbYZYiKORn7CGnzdY600ZzMv3gKJoPvrFdVFsX4SyaUU9Vvxs0XfRqO1sNrXHRR7Mie8FOatN6ujzUMcvnT_n0JV6-PS-mj8u4QuB9zLEuKw6rSorSGBQpZaTBrCQnwWtJQqEsEUWSaZkYEFooI0tKlUlMKRMcs7t97NbZz4F8X2waX1Hbmo7s4AsEQIE6UyKgt__QtR1cF44LFHLUGqQMFOypylnvHdXF1jUb474KDsXOaxG8FjuvxcFrWLk5BA_lhla_Cz8iA3C9Bxoi-pMnVBb-wm8QU3lB</recordid><startdate>20240101</startdate><enddate>20240101</enddate><creator>Cheng, Haoran</creator><creator>Peng, Liang</creator><creator>Yang, Zheng</creator><creator>Lin, Binbin</creator><creator>He, Xiaofei</creator><creator>Wu, Boxi</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><orcidid>https://orcid.org/0009-0000-3950-2700</orcidid><orcidid>https://orcid.org/0000-0003-4494-193X</orcidid></search><sort><creationdate>20240101</creationdate><title>Temporal Feature Fusion for 3D Detection in Monocular Video</title><author>Cheng, Haoran ; Peng, Liang ; Yang, Zheng ; Lin, Binbin ; He, Xiaofei ; Wu, Boxi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c301t-13fbc10dc64baa345e8e90ad61e41f6e4736b33428962a04947a6be57a2ab623</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computation</topic><topic>Feature extraction</topic><topic>History</topic><topic>Image processing</topic><topic>Laser radar</topic><topic>Monocular 3D object detection</topic><topic>Object detection</topic><topic>Optical flow</topic><topic>Optical flow (image analysis)</topic><topic>Point cloud compression</topic><topic>temporal information</topic><topic>Three-dimensional displays</topic><topic>Video</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Cheng, Haoran</creatorcontrib><creatorcontrib>Peng, Liang</creatorcontrib><creatorcontrib>Yang, Zheng</creatorcontrib><creatorcontrib>Lin, Binbin</creatorcontrib><creatorcontrib>He, Xiaofei</creatorcontrib><creatorcontrib>Wu, Boxi</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library Online</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on image processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Cheng, Haoran</au><au>Peng, Liang</au><au>Yang, Zheng</au><au>Lin, Binbin</au><au>He, Xiaofei</au><au>Wu, Boxi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Temporal Feature Fusion for 3D Detection in Monocular Video</atitle><jtitle>IEEE transactions on image processing</jtitle><stitle>TIP</stitle><addtitle>IEEE Trans Image Process</addtitle><date>2024-01-01</date><risdate>2024</risdate><volume>33</volume><spage>2665</spage><epage>2675</epage><pages>2665-2675</pages><issn>1057-7149</issn><eissn>1941-0042</eissn><coden>IIPRE4</coden><abstract>Previous monocular 3D detection works focus on the single frame input in both training and inference. In real-world applications, temporal and motion information naturally exists in monocular video. It is valuable for 3D detection but under-explored in monocular works. In this paper, we propose a straightforward and effective method for temporal feature fusion, which exhibits low computation cost and excellent transferability, making it conveniently applicable to various monocular models. Specifically, with the help of optical flow, we transform the backbone features produced by prior frames and fuse them into the current frame. We introduce the scene feature propagating mechanism, which accumulates history scene features without extra time-consuming. In this process, occluded areas are removed via forward-backward scene consistency. Our method naturally introduces valuable temporal features, facilitating 3D reasoning in monocular 3D detection. Furthermore, accumulated history scene features via scene propagating mitigate heavy computation overheads for video processing. Experiments are conducted on variant baselines, which demonstrate that the proposed method is model-agonistic and can bring significant improvement to multiple types of single-frame methods.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>38530731</pmid><doi>10.1109/TIP.2024.3378475</doi><tpages>11</tpages><orcidid>https://orcid.org/0009-0000-3950-2700</orcidid><orcidid>https://orcid.org/0000-0003-4494-193X</orcidid></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1057-7149 |
ispartof | IEEE transactions on image processing, 2024-01, Vol.33, p.2665-2675 |
issn | 1057-7149 1941-0042 |
language | eng |
recordid | cdi_crossref_primary_10_1109_TIP_2024_3378475 |
source | IEEE Electronic Library (IEL) Journals |
subjects | Computation Feature extraction History Image processing Laser radar Monocular 3D object detection Object detection Optical flow Optical flow (image analysis) Point cloud compression temporal information Three-dimensional displays Video |
title | Temporal Feature Fusion for 3D Detection in Monocular Video |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-13T17%3A25%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Temporal%20Feature%20Fusion%20for%203D%20Detection%20in%20Monocular%20Video&rft.jtitle=IEEE%20transactions%20on%20image%20processing&rft.au=Cheng,%20Haoran&rft.date=2024-01-01&rft.volume=33&rft.spage=2665&rft.epage=2675&rft.pages=2665-2675&rft.issn=1057-7149&rft.eissn=1941-0042&rft.coden=IIPRE4&rft_id=info:doi/10.1109/TIP.2024.3378475&rft_dat=%3Cproquest_cross%3E3003439874%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c301t-13fbc10dc64baa345e8e90ad61e41f6e4736b33428962a04947a6be57a2ab623%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3031399066&rft_id=info:pmid/38530731&rft_ieee_id=10478289&rfr_iscdi=true |