Loading…

Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization

In this paper, we present a 4D human-object interaction (4DHOI) model for solving three vision tasks jointly: i) event segmentation from a video sequence, ii) event recognition and parsing, and iii) contextual object localization. The 4DHOI model represents the geometric, temporal, and semantic rela...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on pattern analysis and machine intelligence 2017-06, Vol.39 (6), p.1165-1179
Main Authors: Wei, Ping, Zhao, Yibiao, Zheng, Nanning, Zhu, Song-Chun
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c351t-8f77ab45e43bed85c4a71ec0b891682e6a48973ae736c92dc042dff288aaffed3
cites cdi_FETCH-LOGICAL-c351t-8f77ab45e43bed85c4a71ec0b891682e6a48973ae736c92dc042dff288aaffed3
container_end_page 1179
container_issue 6
container_start_page 1165
container_title IEEE transactions on pattern analysis and machine intelligence
container_volume 39
creator Wei, Ping
Zhao, Yibiao
Zheng, Nanning
Zhu, Song-Chun
description In this paper, we present a 4D human-object interaction (4DHOI) model for solving three vision tasks jointly: i) event segmentation from a video sequence, ii) event recognition and parsing, and iii) contextual object localization. The 4DHOI model represents the geometric, temporal, and semantic relations in daily events involving human-object interactions. In 3D space, the interactions of human poses and contextual objects are modeled by semantic co-occurrence and geometric compatibility. On the time axis, the interactions are represented as a sequence of atomic event transitions with coherent objects. The 4DHOI model is a hierarchical spatial-temporal graph representation which can be used for inferring scene functionality and object affordance. The graph structures and parameters are learned using an ordered expectation maximization algorithm which mines the spatial-temporal structures of events from RGB-D video samples. Given an input RGB-D video, the inference is performed by a dynamic programming beam search algorithm which simultaneously carries out event segmentation, recognition, and object localization. We collected a large multiview RGB-D event dataset which contains 3,815 video sequences and 383,036 RGB-D frames captured by three RGB-D cameras. The experimental results on three challenging datasets demonstrate the strength of the proposed method.
doi_str_mv 10.1109/TPAMI.2016.2574712
format article
fullrecord <record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_proquest_miscellaneous_1826692464</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>7482729</ieee_id><sourcerecordid>1897022584</sourcerecordid><originalsourceid>FETCH-LOGICAL-c351t-8f77ab45e43bed85c4a71ec0b891682e6a48973ae736c92dc042dff288aaffed3</originalsourceid><addsrcrecordid>eNpdkVtLxDAQhYMoul7-gIIUfPHBrrk2yaN4XVlRvDyXNJ2uXdpEm1bQX2_WXX3wZYbhfHMY5iC0T_CYEKxPnx_O7iZjikk2pkJySegaGhHNdMoE0-toFBWaKkXVFtoOYY4x4QKzTbRFJRVcCT1Cr3e-hKZ2s4RfJDdDa1x6X8zB9snE9dAZ29fehaTyXXLra9cnlx8Q6xPM2tjNQj1JHsH6mauXg3FlsrKYemua-uuH2kUblWkC7K36Dnq5unw-v0mn99eT87NpapkgfaoqKU3BBXBWQKmE5UYSsLhQmmSKQma40pIZkCyzmpYWc1pWFVXKmKqCku2g46XvW-ffBwh93tbBQtMYB34IOVE0yzTlGY_o0T907ofOxesipSWmVKgFRZeU7XwIHVT5W1e3pvvMCc4XOeQ_OeSLHPJVDnHpcGU9FC2Ufyu_j4_AwRKoAeBPllxFQrNvcaCMdw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1897022584</pqid></control><display><type>article</type><title>Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization</title><source>IEEE Electronic Library (IEL) Journals</source><creator>Wei, Ping ; Zhao, Yibiao ; Zheng, Nanning ; Zhu, Song-Chun</creator><creatorcontrib>Wei, Ping ; Zhao, Yibiao ; Zheng, Nanning ; Zhu, Song-Chun</creatorcontrib><description>In this paper, we present a 4D human-object interaction (4DHOI) model for solving three vision tasks jointly: i) event segmentation from a video sequence, ii) event recognition and parsing, and iii) contextual object localization. The 4DHOI model represents the geometric, temporal, and semantic relations in daily events involving human-object interactions. In 3D space, the interactions of human poses and contextual objects are modeled by semantic co-occurrence and geometric compatibility. On the time axis, the interactions are represented as a sequence of atomic event transitions with coherent objects. The 4DHOI model is a hierarchical spatial-temporal graph representation which can be used for inferring scene functionality and object affordance. The graph structures and parameters are learned using an ordered expectation maximization algorithm which mines the spatial-temporal structures of events from RGB-D video samples. Given an input RGB-D video, the inference is performed by a dynamic programming beam search algorithm which simultaneously carries out event segmentation, recognition, and object localization. We collected a large multiview RGB-D event dataset which contains 3,815 video sequences and 383,036 RGB-D frames captured by three RGB-D cameras. The experimental results on three challenging datasets demonstrate the strength of the proposed method.</description><identifier>ISSN: 0162-8828</identifier><identifier>EISSN: 1939-3539</identifier><identifier>EISSN: 2160-9292</identifier><identifier>DOI: 10.1109/TPAMI.2016.2574712</identifier><identifier>PMID: 27254859</identifier><identifier>CODEN: ITPIDJ</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Context modeling ; Dynamic programming ; event recognition ; Hidden Markov models ; Human-object interaction ; Localization ; object affordance ; object localization ; Object recognition ; Robots ; Search algorithms ; Segmentation ; Semantics ; sequence segmentation ; Solid modeling ; Three-dimensional displays ; Video sequences</subject><ispartof>IEEE transactions on pattern analysis and machine intelligence, 2017-06, Vol.39 (6), p.1165-1179</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2017</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c351t-8f77ab45e43bed85c4a71ec0b891682e6a48973ae736c92dc042dff288aaffed3</citedby><cites>FETCH-LOGICAL-c351t-8f77ab45e43bed85c4a71ec0b891682e6a48973ae736c92dc042dff288aaffed3</cites><orcidid>0000-0002-8535-9527</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/7482729$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27923,27924,54795</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/27254859$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Wei, Ping</creatorcontrib><creatorcontrib>Zhao, Yibiao</creatorcontrib><creatorcontrib>Zheng, Nanning</creatorcontrib><creatorcontrib>Zhu, Song-Chun</creatorcontrib><title>Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization</title><title>IEEE transactions on pattern analysis and machine intelligence</title><addtitle>TPAMI</addtitle><addtitle>IEEE Trans Pattern Anal Mach Intell</addtitle><description>In this paper, we present a 4D human-object interaction (4DHOI) model for solving three vision tasks jointly: i) event segmentation from a video sequence, ii) event recognition and parsing, and iii) contextual object localization. The 4DHOI model represents the geometric, temporal, and semantic relations in daily events involving human-object interactions. In 3D space, the interactions of human poses and contextual objects are modeled by semantic co-occurrence and geometric compatibility. On the time axis, the interactions are represented as a sequence of atomic event transitions with coherent objects. The 4DHOI model is a hierarchical spatial-temporal graph representation which can be used for inferring scene functionality and object affordance. The graph structures and parameters are learned using an ordered expectation maximization algorithm which mines the spatial-temporal structures of events from RGB-D video samples. Given an input RGB-D video, the inference is performed by a dynamic programming beam search algorithm which simultaneously carries out event segmentation, recognition, and object localization. We collected a large multiview RGB-D event dataset which contains 3,815 video sequences and 383,036 RGB-D frames captured by three RGB-D cameras. The experimental results on three challenging datasets demonstrate the strength of the proposed method.</description><subject>Context modeling</subject><subject>Dynamic programming</subject><subject>event recognition</subject><subject>Hidden Markov models</subject><subject>Human-object interaction</subject><subject>Localization</subject><subject>object affordance</subject><subject>object localization</subject><subject>Object recognition</subject><subject>Robots</subject><subject>Search algorithms</subject><subject>Segmentation</subject><subject>Semantics</subject><subject>sequence segmentation</subject><subject>Solid modeling</subject><subject>Three-dimensional displays</subject><subject>Video sequences</subject><issn>0162-8828</issn><issn>1939-3539</issn><issn>2160-9292</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2017</creationdate><recordtype>article</recordtype><recordid>eNpdkVtLxDAQhYMoul7-gIIUfPHBrrk2yaN4XVlRvDyXNJ2uXdpEm1bQX2_WXX3wZYbhfHMY5iC0T_CYEKxPnx_O7iZjikk2pkJySegaGhHNdMoE0-toFBWaKkXVFtoOYY4x4QKzTbRFJRVcCT1Cr3e-hKZ2s4RfJDdDa1x6X8zB9snE9dAZ29fehaTyXXLra9cnlx8Q6xPM2tjNQj1JHsH6mauXg3FlsrKYemua-uuH2kUblWkC7K36Dnq5unw-v0mn99eT87NpapkgfaoqKU3BBXBWQKmE5UYSsLhQmmSKQma40pIZkCyzmpYWc1pWFVXKmKqCku2g46XvW-ffBwh93tbBQtMYB34IOVE0yzTlGY_o0T907ofOxesipSWmVKgFRZeU7XwIHVT5W1e3pvvMCc4XOeQ_OeSLHPJVDnHpcGU9FC2Ufyu_j4_AwRKoAeBPllxFQrNvcaCMdw</recordid><startdate>20170601</startdate><enddate>20170601</enddate><creator>Wei, Ping</creator><creator>Zhao, Yibiao</creator><creator>Zheng, Nanning</creator><creator>Zhu, Song-Chun</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-8535-9527</orcidid></search><sort><creationdate>20170601</creationdate><title>Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization</title><author>Wei, Ping ; Zhao, Yibiao ; Zheng, Nanning ; Zhu, Song-Chun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c351t-8f77ab45e43bed85c4a71ec0b891682e6a48973ae736c92dc042dff288aaffed3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2017</creationdate><topic>Context modeling</topic><topic>Dynamic programming</topic><topic>event recognition</topic><topic>Hidden Markov models</topic><topic>Human-object interaction</topic><topic>Localization</topic><topic>object affordance</topic><topic>object localization</topic><topic>Object recognition</topic><topic>Robots</topic><topic>Search algorithms</topic><topic>Segmentation</topic><topic>Semantics</topic><topic>sequence segmentation</topic><topic>Solid modeling</topic><topic>Three-dimensional displays</topic><topic>Video sequences</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wei, Ping</creatorcontrib><creatorcontrib>Zhao, Yibiao</creatorcontrib><creatorcontrib>Zheng, Nanning</creatorcontrib><creatorcontrib>Zhu, Song-Chun</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Xplore Digital Library</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on pattern analysis and machine intelligence</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wei, Ping</au><au>Zhao, Yibiao</au><au>Zheng, Nanning</au><au>Zhu, Song-Chun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization</atitle><jtitle>IEEE transactions on pattern analysis and machine intelligence</jtitle><stitle>TPAMI</stitle><addtitle>IEEE Trans Pattern Anal Mach Intell</addtitle><date>2017-06-01</date><risdate>2017</risdate><volume>39</volume><issue>6</issue><spage>1165</spage><epage>1179</epage><pages>1165-1179</pages><issn>0162-8828</issn><eissn>1939-3539</eissn><eissn>2160-9292</eissn><coden>ITPIDJ</coden><abstract>In this paper, we present a 4D human-object interaction (4DHOI) model for solving three vision tasks jointly: i) event segmentation from a video sequence, ii) event recognition and parsing, and iii) contextual object localization. The 4DHOI model represents the geometric, temporal, and semantic relations in daily events involving human-object interactions. In 3D space, the interactions of human poses and contextual objects are modeled by semantic co-occurrence and geometric compatibility. On the time axis, the interactions are represented as a sequence of atomic event transitions with coherent objects. The 4DHOI model is a hierarchical spatial-temporal graph representation which can be used for inferring scene functionality and object affordance. The graph structures and parameters are learned using an ordered expectation maximization algorithm which mines the spatial-temporal structures of events from RGB-D video samples. Given an input RGB-D video, the inference is performed by a dynamic programming beam search algorithm which simultaneously carries out event segmentation, recognition, and object localization. We collected a large multiview RGB-D event dataset which contains 3,815 video sequences and 383,036 RGB-D frames captured by three RGB-D cameras. The experimental results on three challenging datasets demonstrate the strength of the proposed method.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>27254859</pmid><doi>10.1109/TPAMI.2016.2574712</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0002-8535-9527</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 0162-8828
ispartof IEEE transactions on pattern analysis and machine intelligence, 2017-06, Vol.39 (6), p.1165-1179
issn 0162-8828
1939-3539
2160-9292
language eng
recordid cdi_proquest_miscellaneous_1826692464
source IEEE Electronic Library (IEL) Journals
subjects Context modeling
Dynamic programming
event recognition
Hidden Markov models
Human-object interaction
Localization
object affordance
object localization
Object recognition
Robots
Search algorithms
Segmentation
Semantics
sequence segmentation
Solid modeling
Three-dimensional displays
Video sequences
title Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T15%3A04%3A27IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Modeling%204D%20Human-Object%20Interactions%20for%20Joint%20Event%20Segmentation,%20Recognition,%20and%20Object%20Localization&rft.jtitle=IEEE%20transactions%20on%20pattern%20analysis%20and%20machine%20intelligence&rft.au=Wei,%20Ping&rft.date=2017-06-01&rft.volume=39&rft.issue=6&rft.spage=1165&rft.epage=1179&rft.pages=1165-1179&rft.issn=0162-8828&rft.eissn=1939-3539&rft.coden=ITPIDJ&rft_id=info:doi/10.1109/TPAMI.2016.2574712&rft_dat=%3Cproquest_ieee_%3E1897022584%3C/proquest_ieee_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c351t-8f77ab45e43bed85c4a71ec0b891682e6a48973ae736c92dc042dff288aaffed3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=1897022584&rft_id=info:pmid/27254859&rft_ieee_id=7482729&rfr_iscdi=true