Loading…
Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization
In this paper, we present a 4D human-object interaction (4DHOI) model for solving three vision tasks jointly: i) event segmentation from a video sequence, ii) event recognition and parsing, and iii) contextual object localization. The 4DHOI model represents the geometric, temporal, and semantic rela...
Saved in:
Published in: | IEEE transactions on pattern analysis and machine intelligence 2017-06, Vol.39 (6), p.1165-1179 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c351t-8f77ab45e43bed85c4a71ec0b891682e6a48973ae736c92dc042dff288aaffed3 |
---|---|
cites | cdi_FETCH-LOGICAL-c351t-8f77ab45e43bed85c4a71ec0b891682e6a48973ae736c92dc042dff288aaffed3 |
container_end_page | 1179 |
container_issue | 6 |
container_start_page | 1165 |
container_title | IEEE transactions on pattern analysis and machine intelligence |
container_volume | 39 |
creator | Wei, Ping Zhao, Yibiao Zheng, Nanning Zhu, Song-Chun |
description | In this paper, we present a 4D human-object interaction (4DHOI) model for solving three vision tasks jointly: i) event segmentation from a video sequence, ii) event recognition and parsing, and iii) contextual object localization. The 4DHOI model represents the geometric, temporal, and semantic relations in daily events involving human-object interactions. In 3D space, the interactions of human poses and contextual objects are modeled by semantic co-occurrence and geometric compatibility. On the time axis, the interactions are represented as a sequence of atomic event transitions with coherent objects. The 4DHOI model is a hierarchical spatial-temporal graph representation which can be used for inferring scene functionality and object affordance. The graph structures and parameters are learned using an ordered expectation maximization algorithm which mines the spatial-temporal structures of events from RGB-D video samples. Given an input RGB-D video, the inference is performed by a dynamic programming beam search algorithm which simultaneously carries out event segmentation, recognition, and object localization. We collected a large multiview RGB-D event dataset which contains 3,815 video sequences and 383,036 RGB-D frames captured by three RGB-D cameras. The experimental results on three challenging datasets demonstrate the strength of the proposed method. |
doi_str_mv | 10.1109/TPAMI.2016.2574712 |
format | article |
fullrecord | <record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_proquest_miscellaneous_1826692464</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>7482729</ieee_id><sourcerecordid>1897022584</sourcerecordid><originalsourceid>FETCH-LOGICAL-c351t-8f77ab45e43bed85c4a71ec0b891682e6a48973ae736c92dc042dff288aaffed3</originalsourceid><addsrcrecordid>eNpdkVtLxDAQhYMoul7-gIIUfPHBrrk2yaN4XVlRvDyXNJ2uXdpEm1bQX2_WXX3wZYbhfHMY5iC0T_CYEKxPnx_O7iZjikk2pkJySegaGhHNdMoE0-toFBWaKkXVFtoOYY4x4QKzTbRFJRVcCT1Cr3e-hKZ2s4RfJDdDa1x6X8zB9snE9dAZ29fehaTyXXLra9cnlx8Q6xPM2tjNQj1JHsH6mauXg3FlsrKYemua-uuH2kUblWkC7K36Dnq5unw-v0mn99eT87NpapkgfaoqKU3BBXBWQKmE5UYSsLhQmmSKQma40pIZkCyzmpYWc1pWFVXKmKqCku2g46XvW-ffBwh93tbBQtMYB34IOVE0yzTlGY_o0T907ofOxesipSWmVKgFRZeU7XwIHVT5W1e3pvvMCc4XOeQ_OeSLHPJVDnHpcGU9FC2Ufyu_j4_AwRKoAeBPllxFQrNvcaCMdw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1897022584</pqid></control><display><type>article</type><title>Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization</title><source>IEEE Electronic Library (IEL) Journals</source><creator>Wei, Ping ; Zhao, Yibiao ; Zheng, Nanning ; Zhu, Song-Chun</creator><creatorcontrib>Wei, Ping ; Zhao, Yibiao ; Zheng, Nanning ; Zhu, Song-Chun</creatorcontrib><description>In this paper, we present a 4D human-object interaction (4DHOI) model for solving three vision tasks jointly: i) event segmentation from a video sequence, ii) event recognition and parsing, and iii) contextual object localization. The 4DHOI model represents the geometric, temporal, and semantic relations in daily events involving human-object interactions. In 3D space, the interactions of human poses and contextual objects are modeled by semantic co-occurrence and geometric compatibility. On the time axis, the interactions are represented as a sequence of atomic event transitions with coherent objects. The 4DHOI model is a hierarchical spatial-temporal graph representation which can be used for inferring scene functionality and object affordance. The graph structures and parameters are learned using an ordered expectation maximization algorithm which mines the spatial-temporal structures of events from RGB-D video samples. Given an input RGB-D video, the inference is performed by a dynamic programming beam search algorithm which simultaneously carries out event segmentation, recognition, and object localization. We collected a large multiview RGB-D event dataset which contains 3,815 video sequences and 383,036 RGB-D frames captured by three RGB-D cameras. The experimental results on three challenging datasets demonstrate the strength of the proposed method.</description><identifier>ISSN: 0162-8828</identifier><identifier>EISSN: 1939-3539</identifier><identifier>EISSN: 2160-9292</identifier><identifier>DOI: 10.1109/TPAMI.2016.2574712</identifier><identifier>PMID: 27254859</identifier><identifier>CODEN: ITPIDJ</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Context modeling ; Dynamic programming ; event recognition ; Hidden Markov models ; Human-object interaction ; Localization ; object affordance ; object localization ; Object recognition ; Robots ; Search algorithms ; Segmentation ; Semantics ; sequence segmentation ; Solid modeling ; Three-dimensional displays ; Video sequences</subject><ispartof>IEEE transactions on pattern analysis and machine intelligence, 2017-06, Vol.39 (6), p.1165-1179</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2017</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c351t-8f77ab45e43bed85c4a71ec0b891682e6a48973ae736c92dc042dff288aaffed3</citedby><cites>FETCH-LOGICAL-c351t-8f77ab45e43bed85c4a71ec0b891682e6a48973ae736c92dc042dff288aaffed3</cites><orcidid>0000-0002-8535-9527</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/7482729$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27923,27924,54795</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/27254859$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Wei, Ping</creatorcontrib><creatorcontrib>Zhao, Yibiao</creatorcontrib><creatorcontrib>Zheng, Nanning</creatorcontrib><creatorcontrib>Zhu, Song-Chun</creatorcontrib><title>Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization</title><title>IEEE transactions on pattern analysis and machine intelligence</title><addtitle>TPAMI</addtitle><addtitle>IEEE Trans Pattern Anal Mach Intell</addtitle><description>In this paper, we present a 4D human-object interaction (4DHOI) model for solving three vision tasks jointly: i) event segmentation from a video sequence, ii) event recognition and parsing, and iii) contextual object localization. The 4DHOI model represents the geometric, temporal, and semantic relations in daily events involving human-object interactions. In 3D space, the interactions of human poses and contextual objects are modeled by semantic co-occurrence and geometric compatibility. On the time axis, the interactions are represented as a sequence of atomic event transitions with coherent objects. The 4DHOI model is a hierarchical spatial-temporal graph representation which can be used for inferring scene functionality and object affordance. The graph structures and parameters are learned using an ordered expectation maximization algorithm which mines the spatial-temporal structures of events from RGB-D video samples. Given an input RGB-D video, the inference is performed by a dynamic programming beam search algorithm which simultaneously carries out event segmentation, recognition, and object localization. We collected a large multiview RGB-D event dataset which contains 3,815 video sequences and 383,036 RGB-D frames captured by three RGB-D cameras. The experimental results on three challenging datasets demonstrate the strength of the proposed method.</description><subject>Context modeling</subject><subject>Dynamic programming</subject><subject>event recognition</subject><subject>Hidden Markov models</subject><subject>Human-object interaction</subject><subject>Localization</subject><subject>object affordance</subject><subject>object localization</subject><subject>Object recognition</subject><subject>Robots</subject><subject>Search algorithms</subject><subject>Segmentation</subject><subject>Semantics</subject><subject>sequence segmentation</subject><subject>Solid modeling</subject><subject>Three-dimensional displays</subject><subject>Video sequences</subject><issn>0162-8828</issn><issn>1939-3539</issn><issn>2160-9292</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2017</creationdate><recordtype>article</recordtype><recordid>eNpdkVtLxDAQhYMoul7-gIIUfPHBrrk2yaN4XVlRvDyXNJ2uXdpEm1bQX2_WXX3wZYbhfHMY5iC0T_CYEKxPnx_O7iZjikk2pkJySegaGhHNdMoE0-toFBWaKkXVFtoOYY4x4QKzTbRFJRVcCT1Cr3e-hKZ2s4RfJDdDa1x6X8zB9snE9dAZ29fehaTyXXLra9cnlx8Q6xPM2tjNQj1JHsH6mauXg3FlsrKYemua-uuH2kUblWkC7K36Dnq5unw-v0mn99eT87NpapkgfaoqKU3BBXBWQKmE5UYSsLhQmmSKQma40pIZkCyzmpYWc1pWFVXKmKqCku2g46XvW-ffBwh93tbBQtMYB34IOVE0yzTlGY_o0T907ofOxesipSWmVKgFRZeU7XwIHVT5W1e3pvvMCc4XOeQ_OeSLHPJVDnHpcGU9FC2Ufyu_j4_AwRKoAeBPllxFQrNvcaCMdw</recordid><startdate>20170601</startdate><enddate>20170601</enddate><creator>Wei, Ping</creator><creator>Zhao, Yibiao</creator><creator>Zheng, Nanning</creator><creator>Zhu, Song-Chun</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-8535-9527</orcidid></search><sort><creationdate>20170601</creationdate><title>Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization</title><author>Wei, Ping ; Zhao, Yibiao ; Zheng, Nanning ; Zhu, Song-Chun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c351t-8f77ab45e43bed85c4a71ec0b891682e6a48973ae736c92dc042dff288aaffed3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2017</creationdate><topic>Context modeling</topic><topic>Dynamic programming</topic><topic>event recognition</topic><topic>Hidden Markov models</topic><topic>Human-object interaction</topic><topic>Localization</topic><topic>object affordance</topic><topic>object localization</topic><topic>Object recognition</topic><topic>Robots</topic><topic>Search algorithms</topic><topic>Segmentation</topic><topic>Semantics</topic><topic>sequence segmentation</topic><topic>Solid modeling</topic><topic>Three-dimensional displays</topic><topic>Video sequences</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wei, Ping</creatorcontrib><creatorcontrib>Zhao, Yibiao</creatorcontrib><creatorcontrib>Zheng, Nanning</creatorcontrib><creatorcontrib>Zhu, Song-Chun</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Xplore Digital Library</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on pattern analysis and machine intelligence</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wei, Ping</au><au>Zhao, Yibiao</au><au>Zheng, Nanning</au><au>Zhu, Song-Chun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization</atitle><jtitle>IEEE transactions on pattern analysis and machine intelligence</jtitle><stitle>TPAMI</stitle><addtitle>IEEE Trans Pattern Anal Mach Intell</addtitle><date>2017-06-01</date><risdate>2017</risdate><volume>39</volume><issue>6</issue><spage>1165</spage><epage>1179</epage><pages>1165-1179</pages><issn>0162-8828</issn><eissn>1939-3539</eissn><eissn>2160-9292</eissn><coden>ITPIDJ</coden><abstract>In this paper, we present a 4D human-object interaction (4DHOI) model for solving three vision tasks jointly: i) event segmentation from a video sequence, ii) event recognition and parsing, and iii) contextual object localization. The 4DHOI model represents the geometric, temporal, and semantic relations in daily events involving human-object interactions. In 3D space, the interactions of human poses and contextual objects are modeled by semantic co-occurrence and geometric compatibility. On the time axis, the interactions are represented as a sequence of atomic event transitions with coherent objects. The 4DHOI model is a hierarchical spatial-temporal graph representation which can be used for inferring scene functionality and object affordance. The graph structures and parameters are learned using an ordered expectation maximization algorithm which mines the spatial-temporal structures of events from RGB-D video samples. Given an input RGB-D video, the inference is performed by a dynamic programming beam search algorithm which simultaneously carries out event segmentation, recognition, and object localization. We collected a large multiview RGB-D event dataset which contains 3,815 video sequences and 383,036 RGB-D frames captured by three RGB-D cameras. The experimental results on three challenging datasets demonstrate the strength of the proposed method.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>27254859</pmid><doi>10.1109/TPAMI.2016.2574712</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0002-8535-9527</orcidid></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0162-8828 |
ispartof | IEEE transactions on pattern analysis and machine intelligence, 2017-06, Vol.39 (6), p.1165-1179 |
issn | 0162-8828 1939-3539 2160-9292 |
language | eng |
recordid | cdi_proquest_miscellaneous_1826692464 |
source | IEEE Electronic Library (IEL) Journals |
subjects | Context modeling Dynamic programming event recognition Hidden Markov models Human-object interaction Localization object affordance object localization Object recognition Robots Search algorithms Segmentation Semantics sequence segmentation Solid modeling Three-dimensional displays Video sequences |
title | Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T15%3A04%3A27IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Modeling%204D%20Human-Object%20Interactions%20for%20Joint%20Event%20Segmentation,%20Recognition,%20and%20Object%20Localization&rft.jtitle=IEEE%20transactions%20on%20pattern%20analysis%20and%20machine%20intelligence&rft.au=Wei,%20Ping&rft.date=2017-06-01&rft.volume=39&rft.issue=6&rft.spage=1165&rft.epage=1179&rft.pages=1165-1179&rft.issn=0162-8828&rft.eissn=1939-3539&rft.coden=ITPIDJ&rft_id=info:doi/10.1109/TPAMI.2016.2574712&rft_dat=%3Cproquest_ieee_%3E1897022584%3C/proquest_ieee_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c351t-8f77ab45e43bed85c4a71ec0b891682e6a48973ae736c92dc042dff288aaffed3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=1897022584&rft_id=info:pmid/27254859&rft_ieee_id=7482729&rfr_iscdi=true |