Loading…
A Unified Framework for Depth-Assisted Monocular Object Pose Estimation
Monocular Depth Estimation (MDE) and Object Pose Estimation (OPE) are important tasks in visual scene understanding. Traditionally, these challenges have been addressed independently, with separate deep neural networks designed for each task. However, we contend that MDE, which provides information...
Saved in:
Published in: | IEEE access 2024, Vol.12, p.111723-111740 |
---|---|
Main Authors: | , , , , , , , , , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | cdi_FETCH-LOGICAL-c261t-ae1f13f6e9144d6a0acc0528d869fb5568d8fc17e6554462ca338dc2f4d29c1a3 |
container_end_page | 111740 |
container_issue | |
container_start_page | 111723 |
container_title | IEEE access |
container_volume | 12 |
creator | Hoang, Dinh-Cuong Xuan Tan, Phan Nguyen, Thu-Uyen Pham, Hai-Nam Nguyen, Chi-Minh Bui, Son-Anh Duong, Quang-Tri Vu, van-Duc Nguyen, van-Thiep Duong, van-Hiep Hoang, Ngoc-Anh Phan, Khanh-Toan Tran, Duc-Thanh Ho, Ngoc-Trung Tran, Cong-Trinh |
description | Monocular Depth Estimation (MDE) and Object Pose Estimation (OPE) are important tasks in visual scene understanding. Traditionally, these challenges have been addressed independently, with separate deep neural networks designed for each task. However, we contend that MDE, which provides information about object distances from the camera, and OPE, which focuses on determining precise object position and orientation, are inherently connected. Combining these tasks in a unified approach facilitates the integration of spatial context, offering a more comprehensive understanding of object distribution in three-dimensional space. Consequently, this work addresses both challenges simultaneously, treating them as a multi-task learning problem. Our proposed solution is encapsulated in a Unified Framework for Depth-Assisted Monocular Object Pose Estimation. Leveraging Red-Green-Blue (RGB) images as input, our framework estimates pose of multiple object instances alongside an instance-level depth map. During training, we utilize both depth and color images, but during inference, the model relies exclusively on color images. To enhance the depth-aware features crucial for robust object pose estimation, we introduce a depth estimation branch supervised by depth images. These features undergo further refinement through a cross-task attention module, contributing to the innovation of our method in significantly improving feature discriminability and robustness in object pose estimation. Through extensive experiments, our approach demonstrates competitive performance compared to state-of-the-art methods in object pose estimation. Moreover, our method operates in real-time, underscoring its efficiency and practical applicability in various scenarios. This unified framework not only advances the state of the art in monocular depth estimation and object pose estimation but also underscores the potential of multi-task learning for enhancing the understanding of complex visual scenes. |
doi_str_mv | 10.1109/ACCESS.2024.3443148 |
format | article |
fullrecord | <record><control><sourceid>doaj_ieee_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_97e0579edc2f4935916bbf4c13471bb8</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10634508</ieee_id><doaj_id>oai_doaj_org_article_97e0579edc2f4935916bbf4c13471bb8</doaj_id><sourcerecordid>oai_doaj_org_article_97e0579edc2f4935916bbf4c13471bb8</sourcerecordid><originalsourceid>FETCH-LOGICAL-c261t-ae1f13f6e9144d6a0acc0528d869fb5568d8fc17e6554462ca338dc2f4d29c1a3</originalsourceid><addsrcrecordid>eNpNkN1KAzEQhYMoWGqfQC_2BbbmfzeXy9pqoVKh9jpksxNNbZuSrIhv7_YH6dzM4QznGzgI3RM8JgSrx6quJ8vlmGLKx4xzRnh5hQaUSJUzweT1hb5Fo5TWuJ-yt0QxQM9Vttp556HNptFs4SfEr8yFmD3BvvvMq5R86vrja9gF-70xMVs0a7Bd9hYSZJPU-a3pfNjdoRtnNglG5z1Eq-nkvX7J54vnWV3Nc0sl6XIDxBHmJCjCeSsNNtZiQcu2lMo1QsheOUsKkEJwLqk1jJWtpY63VFli2BDNTtw2mLXex_59_NXBeH00QvzQJnbebkCrArAoFBzjiglFZNM4bgnjBWmasmexE8vGkFIE988jWB-q1adq9aFafa62Tz2cUh4ALhKScYFL9gd0kXSS</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>A Unified Framework for Depth-Assisted Monocular Object Pose Estimation</title><source>IEEE Open Access Journals</source><creator>Hoang, Dinh-Cuong ; Xuan Tan, Phan ; Nguyen, Thu-Uyen ; Pham, Hai-Nam ; Nguyen, Chi-Minh ; Bui, Son-Anh ; Duong, Quang-Tri ; Vu, van-Duc ; Nguyen, van-Thiep ; Duong, van-Hiep ; Hoang, Ngoc-Anh ; Phan, Khanh-Toan ; Tran, Duc-Thanh ; Ho, Ngoc-Trung ; Tran, Cong-Trinh</creator><creatorcontrib>Hoang, Dinh-Cuong ; Xuan Tan, Phan ; Nguyen, Thu-Uyen ; Pham, Hai-Nam ; Nguyen, Chi-Minh ; Bui, Son-Anh ; Duong, Quang-Tri ; Vu, van-Duc ; Nguyen, van-Thiep ; Duong, van-Hiep ; Hoang, Ngoc-Anh ; Phan, Khanh-Toan ; Tran, Duc-Thanh ; Ho, Ngoc-Trung ; Tran, Cong-Trinh</creatorcontrib><description>Monocular Depth Estimation (MDE) and Object Pose Estimation (OPE) are important tasks in visual scene understanding. Traditionally, these challenges have been addressed independently, with separate deep neural networks designed for each task. However, we contend that MDE, which provides information about object distances from the camera, and OPE, which focuses on determining precise object position and orientation, are inherently connected. Combining these tasks in a unified approach facilitates the integration of spatial context, offering a more comprehensive understanding of object distribution in three-dimensional space. Consequently, this work addresses both challenges simultaneously, treating them as a multi-task learning problem. Our proposed solution is encapsulated in a Unified Framework for Depth-Assisted Monocular Object Pose Estimation. Leveraging Red-Green-Blue (RGB) images as input, our framework estimates pose of multiple object instances alongside an instance-level depth map. During training, we utilize both depth and color images, but during inference, the model relies exclusively on color images. To enhance the depth-aware features crucial for robust object pose estimation, we introduce a depth estimation branch supervised by depth images. These features undergo further refinement through a cross-task attention module, contributing to the innovation of our method in significantly improving feature discriminability and robustness in object pose estimation. Through extensive experiments, our approach demonstrates competitive performance compared to state-of-the-art methods in object pose estimation. Moreover, our method operates in real-time, underscoring its efficiency and practical applicability in various scenarios. This unified framework not only advances the state of the art in monocular depth estimation and object pose estimation but also underscores the potential of multi-task learning for enhancing the understanding of complex visual scenes.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2024.3443148</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>IEEE</publisher><subject>Accuracy ; Deep learning ; Feature extraction ; intelligent systems ; Machine vision ; Multitasking ; Neural networks ; Pose estimation ; Robot vision systems ; Semantic segmentation ; Supervised learning ; Task analysis</subject><ispartof>IEEE access, 2024, Vol.12, p.111723-111740</ispartof><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c261t-ae1f13f6e9144d6a0acc0528d869fb5568d8fc17e6554462ca338dc2f4d29c1a3</cites><orcidid>0009-0006-0558-8298 ; 0009-0007-0304-4171 ; 0000-0001-6058-2426 ; 0009-0007-7122-7148 ; 0009-0002-7775-0021 ; 0009-0003-1231-8251 ; 0009-0003-7045-3355 ; 0000-0002-9592-0226 ; 0009-0005-4333-3811 ; 0009-0006-6378-8052 ; 0009-0006-3800-3014 ; 0009-0007-9451-5695 ; 0009-0009-2314-0560 ; 0009-0001-0689-1292 ; 0009-0003-3878-2984</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10634508$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,4024,27633,27923,27924,27925,54933</link.rule.ids></links><search><creatorcontrib>Hoang, Dinh-Cuong</creatorcontrib><creatorcontrib>Xuan Tan, Phan</creatorcontrib><creatorcontrib>Nguyen, Thu-Uyen</creatorcontrib><creatorcontrib>Pham, Hai-Nam</creatorcontrib><creatorcontrib>Nguyen, Chi-Minh</creatorcontrib><creatorcontrib>Bui, Son-Anh</creatorcontrib><creatorcontrib>Duong, Quang-Tri</creatorcontrib><creatorcontrib>Vu, van-Duc</creatorcontrib><creatorcontrib>Nguyen, van-Thiep</creatorcontrib><creatorcontrib>Duong, van-Hiep</creatorcontrib><creatorcontrib>Hoang, Ngoc-Anh</creatorcontrib><creatorcontrib>Phan, Khanh-Toan</creatorcontrib><creatorcontrib>Tran, Duc-Thanh</creatorcontrib><creatorcontrib>Ho, Ngoc-Trung</creatorcontrib><creatorcontrib>Tran, Cong-Trinh</creatorcontrib><title>A Unified Framework for Depth-Assisted Monocular Object Pose Estimation</title><title>IEEE access</title><addtitle>Access</addtitle><description>Monocular Depth Estimation (MDE) and Object Pose Estimation (OPE) are important tasks in visual scene understanding. Traditionally, these challenges have been addressed independently, with separate deep neural networks designed for each task. However, we contend that MDE, which provides information about object distances from the camera, and OPE, which focuses on determining precise object position and orientation, are inherently connected. Combining these tasks in a unified approach facilitates the integration of spatial context, offering a more comprehensive understanding of object distribution in three-dimensional space. Consequently, this work addresses both challenges simultaneously, treating them as a multi-task learning problem. Our proposed solution is encapsulated in a Unified Framework for Depth-Assisted Monocular Object Pose Estimation. Leveraging Red-Green-Blue (RGB) images as input, our framework estimates pose of multiple object instances alongside an instance-level depth map. During training, we utilize both depth and color images, but during inference, the model relies exclusively on color images. To enhance the depth-aware features crucial for robust object pose estimation, we introduce a depth estimation branch supervised by depth images. These features undergo further refinement through a cross-task attention module, contributing to the innovation of our method in significantly improving feature discriminability and robustness in object pose estimation. Through extensive experiments, our approach demonstrates competitive performance compared to state-of-the-art methods in object pose estimation. Moreover, our method operates in real-time, underscoring its efficiency and practical applicability in various scenarios. This unified framework not only advances the state of the art in monocular depth estimation and object pose estimation but also underscores the potential of multi-task learning for enhancing the understanding of complex visual scenes.</description><subject>Accuracy</subject><subject>Deep learning</subject><subject>Feature extraction</subject><subject>intelligent systems</subject><subject>Machine vision</subject><subject>Multitasking</subject><subject>Neural networks</subject><subject>Pose estimation</subject><subject>Robot vision systems</subject><subject>Semantic segmentation</subject><subject>Supervised learning</subject><subject>Task analysis</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>DOA</sourceid><recordid>eNpNkN1KAzEQhYMoWGqfQC_2BbbmfzeXy9pqoVKh9jpksxNNbZuSrIhv7_YH6dzM4QznGzgI3RM8JgSrx6quJ8vlmGLKx4xzRnh5hQaUSJUzweT1hb5Fo5TWuJ-yt0QxQM9Vttp556HNptFs4SfEr8yFmD3BvvvMq5R86vrja9gF-70xMVs0a7Bd9hYSZJPU-a3pfNjdoRtnNglG5z1Eq-nkvX7J54vnWV3Nc0sl6XIDxBHmJCjCeSsNNtZiQcu2lMo1QsheOUsKkEJwLqk1jJWtpY63VFli2BDNTtw2mLXex_59_NXBeH00QvzQJnbebkCrArAoFBzjiglFZNM4bgnjBWmasmexE8vGkFIE988jWB-q1adq9aFafa62Tz2cUh4ALhKScYFL9gd0kXSS</recordid><startdate>2024</startdate><enddate>2024</enddate><creator>Hoang, Dinh-Cuong</creator><creator>Xuan Tan, Phan</creator><creator>Nguyen, Thu-Uyen</creator><creator>Pham, Hai-Nam</creator><creator>Nguyen, Chi-Minh</creator><creator>Bui, Son-Anh</creator><creator>Duong, Quang-Tri</creator><creator>Vu, van-Duc</creator><creator>Nguyen, van-Thiep</creator><creator>Duong, van-Hiep</creator><creator>Hoang, Ngoc-Anh</creator><creator>Phan, Khanh-Toan</creator><creator>Tran, Duc-Thanh</creator><creator>Ho, Ngoc-Trung</creator><creator>Tran, Cong-Trinh</creator><general>IEEE</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>DOA</scope><orcidid>https://orcid.org/0009-0006-0558-8298</orcidid><orcidid>https://orcid.org/0009-0007-0304-4171</orcidid><orcidid>https://orcid.org/0000-0001-6058-2426</orcidid><orcidid>https://orcid.org/0009-0007-7122-7148</orcidid><orcidid>https://orcid.org/0009-0002-7775-0021</orcidid><orcidid>https://orcid.org/0009-0003-1231-8251</orcidid><orcidid>https://orcid.org/0009-0003-7045-3355</orcidid><orcidid>https://orcid.org/0000-0002-9592-0226</orcidid><orcidid>https://orcid.org/0009-0005-4333-3811</orcidid><orcidid>https://orcid.org/0009-0006-6378-8052</orcidid><orcidid>https://orcid.org/0009-0006-3800-3014</orcidid><orcidid>https://orcid.org/0009-0007-9451-5695</orcidid><orcidid>https://orcid.org/0009-0009-2314-0560</orcidid><orcidid>https://orcid.org/0009-0001-0689-1292</orcidid><orcidid>https://orcid.org/0009-0003-3878-2984</orcidid></search><sort><creationdate>2024</creationdate><title>A Unified Framework for Depth-Assisted Monocular Object Pose Estimation</title><author>Hoang, Dinh-Cuong ; Xuan Tan, Phan ; Nguyen, Thu-Uyen ; Pham, Hai-Nam ; Nguyen, Chi-Minh ; Bui, Son-Anh ; Duong, Quang-Tri ; Vu, van-Duc ; Nguyen, van-Thiep ; Duong, van-Hiep ; Hoang, Ngoc-Anh ; Phan, Khanh-Toan ; Tran, Duc-Thanh ; Ho, Ngoc-Trung ; Tran, Cong-Trinh</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c261t-ae1f13f6e9144d6a0acc0528d869fb5568d8fc17e6554462ca338dc2f4d29c1a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Deep learning</topic><topic>Feature extraction</topic><topic>intelligent systems</topic><topic>Machine vision</topic><topic>Multitasking</topic><topic>Neural networks</topic><topic>Pose estimation</topic><topic>Robot vision systems</topic><topic>Semantic segmentation</topic><topic>Supervised learning</topic><topic>Task analysis</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Hoang, Dinh-Cuong</creatorcontrib><creatorcontrib>Xuan Tan, Phan</creatorcontrib><creatorcontrib>Nguyen, Thu-Uyen</creatorcontrib><creatorcontrib>Pham, Hai-Nam</creatorcontrib><creatorcontrib>Nguyen, Chi-Minh</creatorcontrib><creatorcontrib>Bui, Son-Anh</creatorcontrib><creatorcontrib>Duong, Quang-Tri</creatorcontrib><creatorcontrib>Vu, van-Duc</creatorcontrib><creatorcontrib>Nguyen, van-Thiep</creatorcontrib><creatorcontrib>Duong, van-Hiep</creatorcontrib><creatorcontrib>Hoang, Ngoc-Anh</creatorcontrib><creatorcontrib>Phan, Khanh-Toan</creatorcontrib><creatorcontrib>Tran, Duc-Thanh</creatorcontrib><creatorcontrib>Ho, Ngoc-Trung</creatorcontrib><creatorcontrib>Tran, Cong-Trinh</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library Online</collection><collection>CrossRef</collection><collection>Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Hoang, Dinh-Cuong</au><au>Xuan Tan, Phan</au><au>Nguyen, Thu-Uyen</au><au>Pham, Hai-Nam</au><au>Nguyen, Chi-Minh</au><au>Bui, Son-Anh</au><au>Duong, Quang-Tri</au><au>Vu, van-Duc</au><au>Nguyen, van-Thiep</au><au>Duong, van-Hiep</au><au>Hoang, Ngoc-Anh</au><au>Phan, Khanh-Toan</au><au>Tran, Duc-Thanh</au><au>Ho, Ngoc-Trung</au><au>Tran, Cong-Trinh</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Unified Framework for Depth-Assisted Monocular Object Pose Estimation</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2024</date><risdate>2024</risdate><volume>12</volume><spage>111723</spage><epage>111740</epage><pages>111723-111740</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>Monocular Depth Estimation (MDE) and Object Pose Estimation (OPE) are important tasks in visual scene understanding. Traditionally, these challenges have been addressed independently, with separate deep neural networks designed for each task. However, we contend that MDE, which provides information about object distances from the camera, and OPE, which focuses on determining precise object position and orientation, are inherently connected. Combining these tasks in a unified approach facilitates the integration of spatial context, offering a more comprehensive understanding of object distribution in three-dimensional space. Consequently, this work addresses both challenges simultaneously, treating them as a multi-task learning problem. Our proposed solution is encapsulated in a Unified Framework for Depth-Assisted Monocular Object Pose Estimation. Leveraging Red-Green-Blue (RGB) images as input, our framework estimates pose of multiple object instances alongside an instance-level depth map. During training, we utilize both depth and color images, but during inference, the model relies exclusively on color images. To enhance the depth-aware features crucial for robust object pose estimation, we introduce a depth estimation branch supervised by depth images. These features undergo further refinement through a cross-task attention module, contributing to the innovation of our method in significantly improving feature discriminability and robustness in object pose estimation. Through extensive experiments, our approach demonstrates competitive performance compared to state-of-the-art methods in object pose estimation. Moreover, our method operates in real-time, underscoring its efficiency and practical applicability in various scenarios. This unified framework not only advances the state of the art in monocular depth estimation and object pose estimation but also underscores the potential of multi-task learning for enhancing the understanding of complex visual scenes.</abstract><pub>IEEE</pub><doi>10.1109/ACCESS.2024.3443148</doi><tpages>18</tpages><orcidid>https://orcid.org/0009-0006-0558-8298</orcidid><orcidid>https://orcid.org/0009-0007-0304-4171</orcidid><orcidid>https://orcid.org/0000-0001-6058-2426</orcidid><orcidid>https://orcid.org/0009-0007-7122-7148</orcidid><orcidid>https://orcid.org/0009-0002-7775-0021</orcidid><orcidid>https://orcid.org/0009-0003-1231-8251</orcidid><orcidid>https://orcid.org/0009-0003-7045-3355</orcidid><orcidid>https://orcid.org/0000-0002-9592-0226</orcidid><orcidid>https://orcid.org/0009-0005-4333-3811</orcidid><orcidid>https://orcid.org/0009-0006-6378-8052</orcidid><orcidid>https://orcid.org/0009-0006-3800-3014</orcidid><orcidid>https://orcid.org/0009-0007-9451-5695</orcidid><orcidid>https://orcid.org/0009-0009-2314-0560</orcidid><orcidid>https://orcid.org/0009-0001-0689-1292</orcidid><orcidid>https://orcid.org/0009-0003-3878-2984</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2169-3536 |
ispartof | IEEE access, 2024, Vol.12, p.111723-111740 |
issn | 2169-3536 2169-3536 |
language | eng |
recordid | cdi_doaj_primary_oai_doaj_org_article_97e0579edc2f4935916bbf4c13471bb8 |
source | IEEE Open Access Journals |
subjects | Accuracy Deep learning Feature extraction intelligent systems Machine vision Multitasking Neural networks Pose estimation Robot vision systems Semantic segmentation Supervised learning Task analysis |
title | A Unified Framework for Depth-Assisted Monocular Object Pose Estimation |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T03%3A14%3A00IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-doaj_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Unified%20Framework%20for%20Depth-Assisted%20Monocular%20Object%20Pose%20Estimation&rft.jtitle=IEEE%20access&rft.au=Hoang,%20Dinh-Cuong&rft.date=2024&rft.volume=12&rft.spage=111723&rft.epage=111740&rft.pages=111723-111740&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2024.3443148&rft_dat=%3Cdoaj_ieee_%3Eoai_doaj_org_article_97e0579edc2f4935916bbf4c13471bb8%3C/doaj_ieee_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c261t-ae1f13f6e9144d6a0acc0528d869fb5568d8fc17e6554462ca338dc2f4d29c1a3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10634508&rfr_iscdi=true |