Loading…

A Unified Framework for Depth-Assisted Monocular Object Pose Estimation

Monocular Depth Estimation (MDE) and Object Pose Estimation (OPE) are important tasks in visual scene understanding. Traditionally, these challenges have been addressed independently, with separate deep neural networks designed for each task. However, we contend that MDE, which provides information...

Full description

Saved in:
Bibliographic Details
Published in:IEEE access 2024, Vol.12, p.111723-111740
Main Authors: Hoang, Dinh-Cuong, Xuan Tan, Phan, Nguyen, Thu-Uyen, Pham, Hai-Nam, Nguyen, Chi-Minh, Bui, Son-Anh, Duong, Quang-Tri, Vu, van-Duc, Nguyen, van-Thiep, Duong, van-Hiep, Hoang, Ngoc-Anh, Phan, Khanh-Toan, Tran, Duc-Thanh, Ho, Ngoc-Trung, Tran, Cong-Trinh
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c261t-ae1f13f6e9144d6a0acc0528d869fb5568d8fc17e6554462ca338dc2f4d29c1a3
container_end_page 111740
container_issue
container_start_page 111723
container_title IEEE access
container_volume 12
creator Hoang, Dinh-Cuong
Xuan Tan, Phan
Nguyen, Thu-Uyen
Pham, Hai-Nam
Nguyen, Chi-Minh
Bui, Son-Anh
Duong, Quang-Tri
Vu, van-Duc
Nguyen, van-Thiep
Duong, van-Hiep
Hoang, Ngoc-Anh
Phan, Khanh-Toan
Tran, Duc-Thanh
Ho, Ngoc-Trung
Tran, Cong-Trinh
description Monocular Depth Estimation (MDE) and Object Pose Estimation (OPE) are important tasks in visual scene understanding. Traditionally, these challenges have been addressed independently, with separate deep neural networks designed for each task. However, we contend that MDE, which provides information about object distances from the camera, and OPE, which focuses on determining precise object position and orientation, are inherently connected. Combining these tasks in a unified approach facilitates the integration of spatial context, offering a more comprehensive understanding of object distribution in three-dimensional space. Consequently, this work addresses both challenges simultaneously, treating them as a multi-task learning problem. Our proposed solution is encapsulated in a Unified Framework for Depth-Assisted Monocular Object Pose Estimation. Leveraging Red-Green-Blue (RGB) images as input, our framework estimates pose of multiple object instances alongside an instance-level depth map. During training, we utilize both depth and color images, but during inference, the model relies exclusively on color images. To enhance the depth-aware features crucial for robust object pose estimation, we introduce a depth estimation branch supervised by depth images. These features undergo further refinement through a cross-task attention module, contributing to the innovation of our method in significantly improving feature discriminability and robustness in object pose estimation. Through extensive experiments, our approach demonstrates competitive performance compared to state-of-the-art methods in object pose estimation. Moreover, our method operates in real-time, underscoring its efficiency and practical applicability in various scenarios. This unified framework not only advances the state of the art in monocular depth estimation and object pose estimation but also underscores the potential of multi-task learning for enhancing the understanding of complex visual scenes.
doi_str_mv 10.1109/ACCESS.2024.3443148
format article
fullrecord <record><control><sourceid>doaj_ieee_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_97e0579edc2f4935916bbf4c13471bb8</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10634508</ieee_id><doaj_id>oai_doaj_org_article_97e0579edc2f4935916bbf4c13471bb8</doaj_id><sourcerecordid>oai_doaj_org_article_97e0579edc2f4935916bbf4c13471bb8</sourcerecordid><originalsourceid>FETCH-LOGICAL-c261t-ae1f13f6e9144d6a0acc0528d869fb5568d8fc17e6554462ca338dc2f4d29c1a3</originalsourceid><addsrcrecordid>eNpNkN1KAzEQhYMoWGqfQC_2BbbmfzeXy9pqoVKh9jpksxNNbZuSrIhv7_YH6dzM4QznGzgI3RM8JgSrx6quJ8vlmGLKx4xzRnh5hQaUSJUzweT1hb5Fo5TWuJ-yt0QxQM9Vttp556HNptFs4SfEr8yFmD3BvvvMq5R86vrja9gF-70xMVs0a7Bd9hYSZJPU-a3pfNjdoRtnNglG5z1Eq-nkvX7J54vnWV3Nc0sl6XIDxBHmJCjCeSsNNtZiQcu2lMo1QsheOUsKkEJwLqk1jJWtpY63VFli2BDNTtw2mLXex_59_NXBeH00QvzQJnbebkCrArAoFBzjiglFZNM4bgnjBWmasmexE8vGkFIE988jWB-q1adq9aFafa62Tz2cUh4ALhKScYFL9gd0kXSS</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>A Unified Framework for Depth-Assisted Monocular Object Pose Estimation</title><source>IEEE Open Access Journals</source><creator>Hoang, Dinh-Cuong ; Xuan Tan, Phan ; Nguyen, Thu-Uyen ; Pham, Hai-Nam ; Nguyen, Chi-Minh ; Bui, Son-Anh ; Duong, Quang-Tri ; Vu, van-Duc ; Nguyen, van-Thiep ; Duong, van-Hiep ; Hoang, Ngoc-Anh ; Phan, Khanh-Toan ; Tran, Duc-Thanh ; Ho, Ngoc-Trung ; Tran, Cong-Trinh</creator><creatorcontrib>Hoang, Dinh-Cuong ; Xuan Tan, Phan ; Nguyen, Thu-Uyen ; Pham, Hai-Nam ; Nguyen, Chi-Minh ; Bui, Son-Anh ; Duong, Quang-Tri ; Vu, van-Duc ; Nguyen, van-Thiep ; Duong, van-Hiep ; Hoang, Ngoc-Anh ; Phan, Khanh-Toan ; Tran, Duc-Thanh ; Ho, Ngoc-Trung ; Tran, Cong-Trinh</creatorcontrib><description>Monocular Depth Estimation (MDE) and Object Pose Estimation (OPE) are important tasks in visual scene understanding. Traditionally, these challenges have been addressed independently, with separate deep neural networks designed for each task. However, we contend that MDE, which provides information about object distances from the camera, and OPE, which focuses on determining precise object position and orientation, are inherently connected. Combining these tasks in a unified approach facilitates the integration of spatial context, offering a more comprehensive understanding of object distribution in three-dimensional space. Consequently, this work addresses both challenges simultaneously, treating them as a multi-task learning problem. Our proposed solution is encapsulated in a Unified Framework for Depth-Assisted Monocular Object Pose Estimation. Leveraging Red-Green-Blue (RGB) images as input, our framework estimates pose of multiple object instances alongside an instance-level depth map. During training, we utilize both depth and color images, but during inference, the model relies exclusively on color images. To enhance the depth-aware features crucial for robust object pose estimation, we introduce a depth estimation branch supervised by depth images. These features undergo further refinement through a cross-task attention module, contributing to the innovation of our method in significantly improving feature discriminability and robustness in object pose estimation. Through extensive experiments, our approach demonstrates competitive performance compared to state-of-the-art methods in object pose estimation. Moreover, our method operates in real-time, underscoring its efficiency and practical applicability in various scenarios. This unified framework not only advances the state of the art in monocular depth estimation and object pose estimation but also underscores the potential of multi-task learning for enhancing the understanding of complex visual scenes.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2024.3443148</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>IEEE</publisher><subject>Accuracy ; Deep learning ; Feature extraction ; intelligent systems ; Machine vision ; Multitasking ; Neural networks ; Pose estimation ; Robot vision systems ; Semantic segmentation ; Supervised learning ; Task analysis</subject><ispartof>IEEE access, 2024, Vol.12, p.111723-111740</ispartof><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c261t-ae1f13f6e9144d6a0acc0528d869fb5568d8fc17e6554462ca338dc2f4d29c1a3</cites><orcidid>0009-0006-0558-8298 ; 0009-0007-0304-4171 ; 0000-0001-6058-2426 ; 0009-0007-7122-7148 ; 0009-0002-7775-0021 ; 0009-0003-1231-8251 ; 0009-0003-7045-3355 ; 0000-0002-9592-0226 ; 0009-0005-4333-3811 ; 0009-0006-6378-8052 ; 0009-0006-3800-3014 ; 0009-0007-9451-5695 ; 0009-0009-2314-0560 ; 0009-0001-0689-1292 ; 0009-0003-3878-2984</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10634508$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,4024,27633,27923,27924,27925,54933</link.rule.ids></links><search><creatorcontrib>Hoang, Dinh-Cuong</creatorcontrib><creatorcontrib>Xuan Tan, Phan</creatorcontrib><creatorcontrib>Nguyen, Thu-Uyen</creatorcontrib><creatorcontrib>Pham, Hai-Nam</creatorcontrib><creatorcontrib>Nguyen, Chi-Minh</creatorcontrib><creatorcontrib>Bui, Son-Anh</creatorcontrib><creatorcontrib>Duong, Quang-Tri</creatorcontrib><creatorcontrib>Vu, van-Duc</creatorcontrib><creatorcontrib>Nguyen, van-Thiep</creatorcontrib><creatorcontrib>Duong, van-Hiep</creatorcontrib><creatorcontrib>Hoang, Ngoc-Anh</creatorcontrib><creatorcontrib>Phan, Khanh-Toan</creatorcontrib><creatorcontrib>Tran, Duc-Thanh</creatorcontrib><creatorcontrib>Ho, Ngoc-Trung</creatorcontrib><creatorcontrib>Tran, Cong-Trinh</creatorcontrib><title>A Unified Framework for Depth-Assisted Monocular Object Pose Estimation</title><title>IEEE access</title><addtitle>Access</addtitle><description>Monocular Depth Estimation (MDE) and Object Pose Estimation (OPE) are important tasks in visual scene understanding. Traditionally, these challenges have been addressed independently, with separate deep neural networks designed for each task. However, we contend that MDE, which provides information about object distances from the camera, and OPE, which focuses on determining precise object position and orientation, are inherently connected. Combining these tasks in a unified approach facilitates the integration of spatial context, offering a more comprehensive understanding of object distribution in three-dimensional space. Consequently, this work addresses both challenges simultaneously, treating them as a multi-task learning problem. Our proposed solution is encapsulated in a Unified Framework for Depth-Assisted Monocular Object Pose Estimation. Leveraging Red-Green-Blue (RGB) images as input, our framework estimates pose of multiple object instances alongside an instance-level depth map. During training, we utilize both depth and color images, but during inference, the model relies exclusively on color images. To enhance the depth-aware features crucial for robust object pose estimation, we introduce a depth estimation branch supervised by depth images. These features undergo further refinement through a cross-task attention module, contributing to the innovation of our method in significantly improving feature discriminability and robustness in object pose estimation. Through extensive experiments, our approach demonstrates competitive performance compared to state-of-the-art methods in object pose estimation. Moreover, our method operates in real-time, underscoring its efficiency and practical applicability in various scenarios. This unified framework not only advances the state of the art in monocular depth estimation and object pose estimation but also underscores the potential of multi-task learning for enhancing the understanding of complex visual scenes.</description><subject>Accuracy</subject><subject>Deep learning</subject><subject>Feature extraction</subject><subject>intelligent systems</subject><subject>Machine vision</subject><subject>Multitasking</subject><subject>Neural networks</subject><subject>Pose estimation</subject><subject>Robot vision systems</subject><subject>Semantic segmentation</subject><subject>Supervised learning</subject><subject>Task analysis</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>DOA</sourceid><recordid>eNpNkN1KAzEQhYMoWGqfQC_2BbbmfzeXy9pqoVKh9jpksxNNbZuSrIhv7_YH6dzM4QznGzgI3RM8JgSrx6quJ8vlmGLKx4xzRnh5hQaUSJUzweT1hb5Fo5TWuJ-yt0QxQM9Vttp556HNptFs4SfEr8yFmD3BvvvMq5R86vrja9gF-70xMVs0a7Bd9hYSZJPU-a3pfNjdoRtnNglG5z1Eq-nkvX7J54vnWV3Nc0sl6XIDxBHmJCjCeSsNNtZiQcu2lMo1QsheOUsKkEJwLqk1jJWtpY63VFli2BDNTtw2mLXex_59_NXBeH00QvzQJnbebkCrArAoFBzjiglFZNM4bgnjBWmasmexE8vGkFIE988jWB-q1adq9aFafa62Tz2cUh4ALhKScYFL9gd0kXSS</recordid><startdate>2024</startdate><enddate>2024</enddate><creator>Hoang, Dinh-Cuong</creator><creator>Xuan Tan, Phan</creator><creator>Nguyen, Thu-Uyen</creator><creator>Pham, Hai-Nam</creator><creator>Nguyen, Chi-Minh</creator><creator>Bui, Son-Anh</creator><creator>Duong, Quang-Tri</creator><creator>Vu, van-Duc</creator><creator>Nguyen, van-Thiep</creator><creator>Duong, van-Hiep</creator><creator>Hoang, Ngoc-Anh</creator><creator>Phan, Khanh-Toan</creator><creator>Tran, Duc-Thanh</creator><creator>Ho, Ngoc-Trung</creator><creator>Tran, Cong-Trinh</creator><general>IEEE</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>DOA</scope><orcidid>https://orcid.org/0009-0006-0558-8298</orcidid><orcidid>https://orcid.org/0009-0007-0304-4171</orcidid><orcidid>https://orcid.org/0000-0001-6058-2426</orcidid><orcidid>https://orcid.org/0009-0007-7122-7148</orcidid><orcidid>https://orcid.org/0009-0002-7775-0021</orcidid><orcidid>https://orcid.org/0009-0003-1231-8251</orcidid><orcidid>https://orcid.org/0009-0003-7045-3355</orcidid><orcidid>https://orcid.org/0000-0002-9592-0226</orcidid><orcidid>https://orcid.org/0009-0005-4333-3811</orcidid><orcidid>https://orcid.org/0009-0006-6378-8052</orcidid><orcidid>https://orcid.org/0009-0006-3800-3014</orcidid><orcidid>https://orcid.org/0009-0007-9451-5695</orcidid><orcidid>https://orcid.org/0009-0009-2314-0560</orcidid><orcidid>https://orcid.org/0009-0001-0689-1292</orcidid><orcidid>https://orcid.org/0009-0003-3878-2984</orcidid></search><sort><creationdate>2024</creationdate><title>A Unified Framework for Depth-Assisted Monocular Object Pose Estimation</title><author>Hoang, Dinh-Cuong ; Xuan Tan, Phan ; Nguyen, Thu-Uyen ; Pham, Hai-Nam ; Nguyen, Chi-Minh ; Bui, Son-Anh ; Duong, Quang-Tri ; Vu, van-Duc ; Nguyen, van-Thiep ; Duong, van-Hiep ; Hoang, Ngoc-Anh ; Phan, Khanh-Toan ; Tran, Duc-Thanh ; Ho, Ngoc-Trung ; Tran, Cong-Trinh</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c261t-ae1f13f6e9144d6a0acc0528d869fb5568d8fc17e6554462ca338dc2f4d29c1a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Deep learning</topic><topic>Feature extraction</topic><topic>intelligent systems</topic><topic>Machine vision</topic><topic>Multitasking</topic><topic>Neural networks</topic><topic>Pose estimation</topic><topic>Robot vision systems</topic><topic>Semantic segmentation</topic><topic>Supervised learning</topic><topic>Task analysis</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Hoang, Dinh-Cuong</creatorcontrib><creatorcontrib>Xuan Tan, Phan</creatorcontrib><creatorcontrib>Nguyen, Thu-Uyen</creatorcontrib><creatorcontrib>Pham, Hai-Nam</creatorcontrib><creatorcontrib>Nguyen, Chi-Minh</creatorcontrib><creatorcontrib>Bui, Son-Anh</creatorcontrib><creatorcontrib>Duong, Quang-Tri</creatorcontrib><creatorcontrib>Vu, van-Duc</creatorcontrib><creatorcontrib>Nguyen, van-Thiep</creatorcontrib><creatorcontrib>Duong, van-Hiep</creatorcontrib><creatorcontrib>Hoang, Ngoc-Anh</creatorcontrib><creatorcontrib>Phan, Khanh-Toan</creatorcontrib><creatorcontrib>Tran, Duc-Thanh</creatorcontrib><creatorcontrib>Ho, Ngoc-Trung</creatorcontrib><creatorcontrib>Tran, Cong-Trinh</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library Online</collection><collection>CrossRef</collection><collection>Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Hoang, Dinh-Cuong</au><au>Xuan Tan, Phan</au><au>Nguyen, Thu-Uyen</au><au>Pham, Hai-Nam</au><au>Nguyen, Chi-Minh</au><au>Bui, Son-Anh</au><au>Duong, Quang-Tri</au><au>Vu, van-Duc</au><au>Nguyen, van-Thiep</au><au>Duong, van-Hiep</au><au>Hoang, Ngoc-Anh</au><au>Phan, Khanh-Toan</au><au>Tran, Duc-Thanh</au><au>Ho, Ngoc-Trung</au><au>Tran, Cong-Trinh</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Unified Framework for Depth-Assisted Monocular Object Pose Estimation</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2024</date><risdate>2024</risdate><volume>12</volume><spage>111723</spage><epage>111740</epage><pages>111723-111740</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>Monocular Depth Estimation (MDE) and Object Pose Estimation (OPE) are important tasks in visual scene understanding. Traditionally, these challenges have been addressed independently, with separate deep neural networks designed for each task. However, we contend that MDE, which provides information about object distances from the camera, and OPE, which focuses on determining precise object position and orientation, are inherently connected. Combining these tasks in a unified approach facilitates the integration of spatial context, offering a more comprehensive understanding of object distribution in three-dimensional space. Consequently, this work addresses both challenges simultaneously, treating them as a multi-task learning problem. Our proposed solution is encapsulated in a Unified Framework for Depth-Assisted Monocular Object Pose Estimation. Leveraging Red-Green-Blue (RGB) images as input, our framework estimates pose of multiple object instances alongside an instance-level depth map. During training, we utilize both depth and color images, but during inference, the model relies exclusively on color images. To enhance the depth-aware features crucial for robust object pose estimation, we introduce a depth estimation branch supervised by depth images. These features undergo further refinement through a cross-task attention module, contributing to the innovation of our method in significantly improving feature discriminability and robustness in object pose estimation. Through extensive experiments, our approach demonstrates competitive performance compared to state-of-the-art methods in object pose estimation. Moreover, our method operates in real-time, underscoring its efficiency and practical applicability in various scenarios. This unified framework not only advances the state of the art in monocular depth estimation and object pose estimation but also underscores the potential of multi-task learning for enhancing the understanding of complex visual scenes.</abstract><pub>IEEE</pub><doi>10.1109/ACCESS.2024.3443148</doi><tpages>18</tpages><orcidid>https://orcid.org/0009-0006-0558-8298</orcidid><orcidid>https://orcid.org/0009-0007-0304-4171</orcidid><orcidid>https://orcid.org/0000-0001-6058-2426</orcidid><orcidid>https://orcid.org/0009-0007-7122-7148</orcidid><orcidid>https://orcid.org/0009-0002-7775-0021</orcidid><orcidid>https://orcid.org/0009-0003-1231-8251</orcidid><orcidid>https://orcid.org/0009-0003-7045-3355</orcidid><orcidid>https://orcid.org/0000-0002-9592-0226</orcidid><orcidid>https://orcid.org/0009-0005-4333-3811</orcidid><orcidid>https://orcid.org/0009-0006-6378-8052</orcidid><orcidid>https://orcid.org/0009-0006-3800-3014</orcidid><orcidid>https://orcid.org/0009-0007-9451-5695</orcidid><orcidid>https://orcid.org/0009-0009-2314-0560</orcidid><orcidid>https://orcid.org/0009-0001-0689-1292</orcidid><orcidid>https://orcid.org/0009-0003-3878-2984</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2169-3536
ispartof IEEE access, 2024, Vol.12, p.111723-111740
issn 2169-3536
2169-3536
language eng
recordid cdi_doaj_primary_oai_doaj_org_article_97e0579edc2f4935916bbf4c13471bb8
source IEEE Open Access Journals
subjects Accuracy
Deep learning
Feature extraction
intelligent systems
Machine vision
Multitasking
Neural networks
Pose estimation
Robot vision systems
Semantic segmentation
Supervised learning
Task analysis
title A Unified Framework for Depth-Assisted Monocular Object Pose Estimation
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T03%3A14%3A00IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-doaj_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Unified%20Framework%20for%20Depth-Assisted%20Monocular%20Object%20Pose%20Estimation&rft.jtitle=IEEE%20access&rft.au=Hoang,%20Dinh-Cuong&rft.date=2024&rft.volume=12&rft.spage=111723&rft.epage=111740&rft.pages=111723-111740&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2024.3443148&rft_dat=%3Cdoaj_ieee_%3Eoai_doaj_org_article_97e0579edc2f4935916bbf4c13471bb8%3C/doaj_ieee_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c261t-ae1f13f6e9144d6a0acc0528d869fb5568d8fc17e6554462ca338dc2f4d29c1a3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10634508&rfr_iscdi=true