Loading…

ERNet: An Efficient and Reliable Human-Object Interaction Detection Network

Human-Object Interaction (HOI) detection recognizes how persons interact with objects, which is advantageous in autonomous systems such as self-driving vehicles and collaborative robots. However, current HOI detectors are often plagued by model inefficiency and unreliability when making a prediction...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on image processing 2023, Vol.32, p.964-979
Main Authors: Lim, JunYi, Baskaran, Vishnu Monn, Lim, Joanne Mun-Yee, Wong, KokSheik, See, John, Tistarelli, Massimo
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c348t-7d640b2c675bbb62189b0f88bb2f901e53e07748cfa9a07c5df540e45dff6e903
cites cdi_FETCH-LOGICAL-c348t-7d640b2c675bbb62189b0f88bb2f901e53e07748cfa9a07c5df540e45dff6e903
container_end_page 979
container_issue
container_start_page 964
container_title IEEE transactions on image processing
container_volume 32
creator Lim, JunYi
Baskaran, Vishnu Monn
Lim, Joanne Mun-Yee
Wong, KokSheik
See, John
Tistarelli, Massimo
description Human-Object Interaction (HOI) detection recognizes how persons interact with objects, which is advantageous in autonomous systems such as self-driving vehicles and collaborative robots. However, current HOI detectors are often plagued by model inefficiency and unreliability when making a prediction, which consequently limits its potential for real-world scenarios. In this paper, we address these challenges by proposing ERNet, an end-to-end trainable convolutional-transformer network for HOI detection. The proposed model employs an efficient multi-scale deformable attention to effectively capture vital HOI features. We also put forward a novel detection attention module to adaptively generate semantically rich instance and interaction tokens. These tokens undergo pre-emptive detections to produce initial region and vector proposals that also serve as queries which enhances the feature refinement process in the transformer decoders. Several impactful enhancements are also applied to improve the HOI representation learning. Additionally, we utilize a predictive uncertainty estimation framework in the instance and interaction classification heads to quantify the uncertainty behind each prediction. By doing so, we can accurately and reliably predict HOIs even under challenging scenarios. Experiment results on the HICO-Det, V-COCO, and HOI-A datasets demonstrate that the proposed model achieves state-of-the-art performance in detection accuracy and training efficiency. Codes are publicly available at https://github.com/Monash-CyPhi-AI-Research-Lab/ernet .
doi_str_mv 10.1109/TIP.2022.3231528
format article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmed_primary_37022006</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10026602</ieee_id><sourcerecordid>2770781932</sourcerecordid><originalsourceid>FETCH-LOGICAL-c348t-7d640b2c675bbb62189b0f88bb2f901e53e07748cfa9a07c5df540e45dff6e903</originalsourceid><addsrcrecordid>eNpdkE1Lw0AQhhdRbK3ePYgEvHhJnd1ssllvpVZbLFZKPYfdzSyk5qPmA_HfuyVVxNPM4XlfZh5CLimMKQV5t1m8jhkwNg5YQEMWH5EhlZz6AJwdux1C4QvK5YCcNc0WgPKQRqdkEAgXAoiG5Hm2fsH23puU3szazGRYtp4qU2-NeaZ0jt68K1Tpr_QWTestyhZrZdqsKr0HbLHfXMNnVb-fkxOr8gYvDnNE3h5nm-ncX66eFtPJ0jcBj1tfpBEHzUwkQq11xGgsNdg41ppZCRTDAEEIHhurpAJhwtSGHJC7aSOUEIzIbd-7q6uPDps2KbLGYJ6rEquuSZiQ7mkesz168w_dVl1duuscJUDEVDp1IwI9ZeqqaWq0ya7OClV_JRSSvejEiU72opODaBe5PhR3usD0N_Bj1gFXPZAh4p8-YFEELPgGpSd_fQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2770781932</pqid></control><display><type>article</type><title>ERNet: An Efficient and Reliable Human-Object Interaction Detection Network</title><source>IEEE Electronic Library (IEL) Journals</source><creator>Lim, JunYi ; Baskaran, Vishnu Monn ; Lim, Joanne Mun-Yee ; Wong, KokSheik ; See, John ; Tistarelli, Massimo</creator><creatorcontrib>Lim, JunYi ; Baskaran, Vishnu Monn ; Lim, Joanne Mun-Yee ; Wong, KokSheik ; See, John ; Tistarelli, Massimo</creatorcontrib><description>Human-Object Interaction (HOI) detection recognizes how persons interact with objects, which is advantageous in autonomous systems such as self-driving vehicles and collaborative robots. However, current HOI detectors are often plagued by model inefficiency and unreliability when making a prediction, which consequently limits its potential for real-world scenarios. In this paper, we address these challenges by proposing ERNet, an end-to-end trainable convolutional-transformer network for HOI detection. The proposed model employs an efficient multi-scale deformable attention to effectively capture vital HOI features. We also put forward a novel detection attention module to adaptively generate semantically rich instance and interaction tokens. These tokens undergo pre-emptive detections to produce initial region and vector proposals that also serve as queries which enhances the feature refinement process in the transformer decoders. Several impactful enhancements are also applied to improve the HOI representation learning. Additionally, we utilize a predictive uncertainty estimation framework in the instance and interaction classification heads to quantify the uncertainty behind each prediction. By doing so, we can accurately and reliably predict HOIs even under challenging scenarios. Experiment results on the HICO-Det, V-COCO, and HOI-A datasets demonstrate that the proposed model achieves state-of-the-art performance in detection accuracy and training efficiency. Codes are publicly available at https://github.com/Monash-CyPhi-AI-Research-Lab/ernet .</description><identifier>ISSN: 1057-7149</identifier><identifier>EISSN: 1941-0042</identifier><identifier>DOI: 10.1109/TIP.2022.3231528</identifier><identifier>PMID: 37022006</identifier><identifier>CODEN: IIPRE4</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Adaptation models ; Attention ; Autonomous cars ; Decoders ; Decoding ; deformable attention ; Deformation effects ; Estimation ; Feature extraction ; Formability ; Human-object interaction detection ; Humans ; Object recognition ; Query processing ; Training ; transformer ; Transformers ; Uncertainty ; uncertainty estimation</subject><ispartof>IEEE transactions on image processing, 2023, Vol.32, p.964-979</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c348t-7d640b2c675bbb62189b0f88bb2f901e53e07748cfa9a07c5df540e45dff6e903</citedby><cites>FETCH-LOGICAL-c348t-7d640b2c675bbb62189b0f88bb2f901e53e07748cfa9a07c5df540e45dff6e903</cites><orcidid>0000-0002-4893-2291 ; 0000-0001-6809-5817 ; 0000-0002-3406-3048 ; 0000-0002-7381-4684 ; 0000-0002-1326-8634</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10026602$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,4024,27923,27924,27925,54796</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/37022006$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Lim, JunYi</creatorcontrib><creatorcontrib>Baskaran, Vishnu Monn</creatorcontrib><creatorcontrib>Lim, Joanne Mun-Yee</creatorcontrib><creatorcontrib>Wong, KokSheik</creatorcontrib><creatorcontrib>See, John</creatorcontrib><creatorcontrib>Tistarelli, Massimo</creatorcontrib><title>ERNet: An Efficient and Reliable Human-Object Interaction Detection Network</title><title>IEEE transactions on image processing</title><addtitle>TIP</addtitle><addtitle>IEEE Trans Image Process</addtitle><description>Human-Object Interaction (HOI) detection recognizes how persons interact with objects, which is advantageous in autonomous systems such as self-driving vehicles and collaborative robots. However, current HOI detectors are often plagued by model inefficiency and unreliability when making a prediction, which consequently limits its potential for real-world scenarios. In this paper, we address these challenges by proposing ERNet, an end-to-end trainable convolutional-transformer network for HOI detection. The proposed model employs an efficient multi-scale deformable attention to effectively capture vital HOI features. We also put forward a novel detection attention module to adaptively generate semantically rich instance and interaction tokens. These tokens undergo pre-emptive detections to produce initial region and vector proposals that also serve as queries which enhances the feature refinement process in the transformer decoders. Several impactful enhancements are also applied to improve the HOI representation learning. Additionally, we utilize a predictive uncertainty estimation framework in the instance and interaction classification heads to quantify the uncertainty behind each prediction. By doing so, we can accurately and reliably predict HOIs even under challenging scenarios. Experiment results on the HICO-Det, V-COCO, and HOI-A datasets demonstrate that the proposed model achieves state-of-the-art performance in detection accuracy and training efficiency. Codes are publicly available at https://github.com/Monash-CyPhi-AI-Research-Lab/ernet .</description><subject>Adaptation models</subject><subject>Attention</subject><subject>Autonomous cars</subject><subject>Decoders</subject><subject>Decoding</subject><subject>deformable attention</subject><subject>Deformation effects</subject><subject>Estimation</subject><subject>Feature extraction</subject><subject>Formability</subject><subject>Human-object interaction detection</subject><subject>Humans</subject><subject>Object recognition</subject><subject>Query processing</subject><subject>Training</subject><subject>transformer</subject><subject>Transformers</subject><subject>Uncertainty</subject><subject>uncertainty estimation</subject><issn>1057-7149</issn><issn>1941-0042</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNpdkE1Lw0AQhhdRbK3ePYgEvHhJnd1ssllvpVZbLFZKPYfdzSyk5qPmA_HfuyVVxNPM4XlfZh5CLimMKQV5t1m8jhkwNg5YQEMWH5EhlZz6AJwdux1C4QvK5YCcNc0WgPKQRqdkEAgXAoiG5Hm2fsH23puU3szazGRYtp4qU2-NeaZ0jt68K1Tpr_QWTestyhZrZdqsKr0HbLHfXMNnVb-fkxOr8gYvDnNE3h5nm-ncX66eFtPJ0jcBj1tfpBEHzUwkQq11xGgsNdg41ppZCRTDAEEIHhurpAJhwtSGHJC7aSOUEIzIbd-7q6uPDps2KbLGYJ6rEquuSZiQ7mkesz168w_dVl1duuscJUDEVDp1IwI9ZeqqaWq0ya7OClV_JRSSvejEiU72opODaBe5PhR3usD0N_Bj1gFXPZAh4p8-YFEELPgGpSd_fQ</recordid><startdate>2023</startdate><enddate>2023</enddate><creator>Lim, JunYi</creator><creator>Baskaran, Vishnu Monn</creator><creator>Lim, Joanne Mun-Yee</creator><creator>Wong, KokSheik</creator><creator>See, John</creator><creator>Tistarelli, Massimo</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-4893-2291</orcidid><orcidid>https://orcid.org/0000-0001-6809-5817</orcidid><orcidid>https://orcid.org/0000-0002-3406-3048</orcidid><orcidid>https://orcid.org/0000-0002-7381-4684</orcidid><orcidid>https://orcid.org/0000-0002-1326-8634</orcidid></search><sort><creationdate>2023</creationdate><title>ERNet: An Efficient and Reliable Human-Object Interaction Detection Network</title><author>Lim, JunYi ; Baskaran, Vishnu Monn ; Lim, Joanne Mun-Yee ; Wong, KokSheik ; See, John ; Tistarelli, Massimo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c348t-7d640b2c675bbb62189b0f88bb2f901e53e07748cfa9a07c5df540e45dff6e903</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Adaptation models</topic><topic>Attention</topic><topic>Autonomous cars</topic><topic>Decoders</topic><topic>Decoding</topic><topic>deformable attention</topic><topic>Deformation effects</topic><topic>Estimation</topic><topic>Feature extraction</topic><topic>Formability</topic><topic>Human-object interaction detection</topic><topic>Humans</topic><topic>Object recognition</topic><topic>Query processing</topic><topic>Training</topic><topic>transformer</topic><topic>Transformers</topic><topic>Uncertainty</topic><topic>uncertainty estimation</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Lim, JunYi</creatorcontrib><creatorcontrib>Baskaran, Vishnu Monn</creatorcontrib><creatorcontrib>Lim, Joanne Mun-Yee</creatorcontrib><creatorcontrib>Wong, KokSheik</creatorcontrib><creatorcontrib>See, John</creatorcontrib><creatorcontrib>Tistarelli, Massimo</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Xplore</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on image processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Lim, JunYi</au><au>Baskaran, Vishnu Monn</au><au>Lim, Joanne Mun-Yee</au><au>Wong, KokSheik</au><au>See, John</au><au>Tistarelli, Massimo</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>ERNet: An Efficient and Reliable Human-Object Interaction Detection Network</atitle><jtitle>IEEE transactions on image processing</jtitle><stitle>TIP</stitle><addtitle>IEEE Trans Image Process</addtitle><date>2023</date><risdate>2023</risdate><volume>32</volume><spage>964</spage><epage>979</epage><pages>964-979</pages><issn>1057-7149</issn><eissn>1941-0042</eissn><coden>IIPRE4</coden><abstract>Human-Object Interaction (HOI) detection recognizes how persons interact with objects, which is advantageous in autonomous systems such as self-driving vehicles and collaborative robots. However, current HOI detectors are often plagued by model inefficiency and unreliability when making a prediction, which consequently limits its potential for real-world scenarios. In this paper, we address these challenges by proposing ERNet, an end-to-end trainable convolutional-transformer network for HOI detection. The proposed model employs an efficient multi-scale deformable attention to effectively capture vital HOI features. We also put forward a novel detection attention module to adaptively generate semantically rich instance and interaction tokens. These tokens undergo pre-emptive detections to produce initial region and vector proposals that also serve as queries which enhances the feature refinement process in the transformer decoders. Several impactful enhancements are also applied to improve the HOI representation learning. Additionally, we utilize a predictive uncertainty estimation framework in the instance and interaction classification heads to quantify the uncertainty behind each prediction. By doing so, we can accurately and reliably predict HOIs even under challenging scenarios. Experiment results on the HICO-Det, V-COCO, and HOI-A datasets demonstrate that the proposed model achieves state-of-the-art performance in detection accuracy and training efficiency. Codes are publicly available at https://github.com/Monash-CyPhi-AI-Research-Lab/ernet .</abstract><cop>United States</cop><pub>IEEE</pub><pmid>37022006</pmid><doi>10.1109/TIP.2022.3231528</doi><tpages>16</tpages><orcidid>https://orcid.org/0000-0002-4893-2291</orcidid><orcidid>https://orcid.org/0000-0001-6809-5817</orcidid><orcidid>https://orcid.org/0000-0002-3406-3048</orcidid><orcidid>https://orcid.org/0000-0002-7381-4684</orcidid><orcidid>https://orcid.org/0000-0002-1326-8634</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 1057-7149
ispartof IEEE transactions on image processing, 2023, Vol.32, p.964-979
issn 1057-7149
1941-0042
language eng
recordid cdi_pubmed_primary_37022006
source IEEE Electronic Library (IEL) Journals
subjects Adaptation models
Attention
Autonomous cars
Decoders
Decoding
deformable attention
Deformation effects
Estimation
Feature extraction
Formability
Human-object interaction detection
Humans
Object recognition
Query processing
Training
transformer
Transformers
Uncertainty
uncertainty estimation
title ERNet: An Efficient and Reliable Human-Object Interaction Detection Network
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-02T19%3A04%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=ERNet:%20An%20Efficient%20and%20Reliable%20Human-Object%20Interaction%20Detection%20Network&rft.jtitle=IEEE%20transactions%20on%20image%20processing&rft.au=Lim,%20JunYi&rft.date=2023&rft.volume=32&rft.spage=964&rft.epage=979&rft.pages=964-979&rft.issn=1057-7149&rft.eissn=1941-0042&rft.coden=IIPRE4&rft_id=info:doi/10.1109/TIP.2022.3231528&rft_dat=%3Cproquest_pubme%3E2770781932%3C/proquest_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c348t-7d640b2c675bbb62189b0f88bb2f901e53e07748cfa9a07c5df540e45dff6e903%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2770781932&rft_id=info:pmid/37022006&rft_ieee_id=10026602&rfr_iscdi=true