Loading…

Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization

In this paper, we present a 4D human-object interaction (4DHOI) model for solving three vision tasks jointly: i) event segmentation from a video sequence, ii) event recognition and parsing, and iii) contextual object localization. The 4DHOI model represents the geometric, temporal, and semantic rela...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on pattern analysis and machine intelligence 2017-06, Vol.39 (6), p.1165-1179
Main Authors:	Wei, Ping, Zhao, Yibiao, Zheng, Nanning, Zhu, Song-Chun
Format:	Article
Language:	English
Subjects:	Context modeling Dynamic programming event recognition Hidden Markov models Human-object interaction Localization object affordance object localization Object recognition Robots Search algorithms Segmentation Semantics sequence segmentation Solid modeling Three-dimensional displays Video sequences
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c351t-8f77ab45e43bed85c4a71ec0b891682e6a48973ae736c92dc042dff288aaffed3
cites	cdi_FETCH-LOGICAL-c351t-8f77ab45e43bed85c4a71ec0b891682e6a48973ae736c92dc042dff288aaffed3
container_end_page	1179
container_issue	6
container_start_page	1165
container_title	IEEE transactions on pattern analysis and machine intelligence
container_volume	39
creator	Wei, Ping Zhao, Yibiao Zheng, Nanning Zhu, Song-Chun
description	In this paper, we present a 4D human-object interaction (4DHOI) model for solving three vision tasks jointly: i) event segmentation from a video sequence, ii) event recognition and parsing, and iii) contextual object localization. The 4DHOI model represents the geometric, temporal, and semantic relations in daily events involving human-object interactions. In 3D space, the interactions of human poses and contextual objects are modeled by semantic co-occurrence and geometric compatibility. On the time axis, the interactions are represented as a sequence of atomic event transitions with coherent objects. The 4DHOI model is a hierarchical spatial-temporal graph representation which can be used for inferring scene functionality and object affordance. The graph structures and parameters are learned using an ordered expectation maximization algorithm which mines the spatial-temporal structures of events from RGB-D video samples. Given an input RGB-D video, the inference is performed by a dynamic programming beam search algorithm which simultaneously carries out event segmentation, recognition, and object localization. We collected a large multiview RGB-D event dataset which contains 3,815 video sequences and 383,036 RGB-D frames captured by three RGB-D cameras. The experimental results on three challenging datasets demonstrate the strength of the proposed method.
doi_str_mv	10.1109/TPAMI.2016.2574712
format	article
fullrecord	<record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_proquest_miscellaneous_1826692464</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>7482729</ieee_id><sourcerecordid>1897022584</sourcerecordid><originalsourceid>FETCH-LOGICAL-c351t-8f77ab45e43bed85c4a71ec0b891682e6a48973ae736c92dc042dff288aaffed3</originalsourceid><addsrcrecordid>eNpdkVtLxDAQhYMoul7-gIIUfPHBrrk2yaN4XVlRvDyXNJ2uXdpEm1bQX2_WXX3wZYbhfHMY5iC0T_CYEKxPnx_O7iZjikk2pkJySegaGhHNdMoE0-toFBWaKkXVFtoOYY4x4QKzTbRFJRVcCT1Cr3e-hKZ2s4RfJDdDa1x6X8zB9snE9dAZ29fehaTyXXLra9cnlx8Q6xPM2tjNQj1JHsH6mauXg3FlsrKYemua-uuH2kUblWkC7K36Dnq5unw-v0mn99eT87NpapkgfaoqKU3BBXBWQKmE5UYSsLhQmmSKQma40pIZkCyzmpYWc1pWFVXKmKqCku2g46XvW-ffBwh93tbBQtMYB34IOVE0yzTlGY_o0T907ofOxesipSWmVKgFRZeU7XwIHVT5W1e3pvvMCc4XOeQ_OeSLHPJVDnHpcGU9FC2Ufyu_j4_AwRKoAeBPllxFQrNvcaCMdw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1897022584</pqid></control><display><type>article</type><title>Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization</title><source>IEEE Electronic Library (IEL) Journals</source><creator>Wei, Ping ; Zhao, Yibiao ; Zheng, Nanning ; Zhu, Song-Chun</creator><creatorcontrib>Wei, Ping ; Zhao, Yibiao ; Zheng, Nanning ; Zhu, Song-Chun</creatorcontrib><description>In this paper, we present a 4D human-object interaction (4DHOI) model for solving three vision tasks jointly: i) event segmentation from a video sequence, ii) event recognition and parsing, and iii) contextual object localization. The 4DHOI model represents the geometric, temporal, and semantic relations in daily events involving human-object interactions. In 3D space, the interactions of human poses and contextual objects are modeled by semantic co-occurrence and geometric compatibility. On the time axis, the interactions are represented as a sequence of atomic event transitions with coherent objects. The 4DHOI model is a hierarchical spatial-temporal graph representation which can be used for inferring scene functionality and object affordance. The graph structures and parameters are learned using an ordered expectation maximization algorithm which mines the spatial-temporal structures of events from RGB-D video samples. Given an input RGB-D video, the inference is performed by a dynamic programming beam search algorithm which simultaneously carries out event segmentation, recognition, and object localization. We collected a large multiview RGB-D event dataset which contains 3,815 video sequences and 383,036 RGB-D frames captured by three RGB-D cameras. The experimental results on three challenging datasets demonstrate the strength of the proposed method.</description><identifier>ISSN: 0162-8828</identifier><identifier>EISSN: 1939-3539</identifier><identifier>EISSN: 2160-9292</identifier><identifier>DOI: 10.1109/TPAMI.2016.2574712</identifier><identifier>PMID: 27254859</identifier><identifier>CODEN: ITPIDJ</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Context modeling ; Dynamic programming ; event recognition ; Hidden Markov models ; Human-object interaction ; Localization ; object affordance ; object localization ; Object recognition ; Robots ; Search algorithms ; Segmentation ; Semantics ; sequence segmentation ; Solid modeling ; Three-dimensional displays ; Video sequences</subject><ispartof>IEEE transactions on pattern analysis and machine intelligence, 2017-06, Vol.39 (6), p.1165-1179</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2017</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c351t-8f77ab45e43bed85c4a71ec0b891682e6a48973ae736c92dc042dff288aaffed3</citedby><cites>FETCH-LOGICAL-c351t-8f77ab45e43bed85c4a71ec0b891682e6a48973ae736c92dc042dff288aaffed3</cites><orcidid>0000-0002-8535-9527</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/7482729$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27923,27924,54795</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/27254859$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Wei, Ping</creatorcontrib><creatorcontrib>Zhao, Yibiao</creatorcontrib><creatorcontrib>Zheng, Nanning</creatorcontrib><creatorcontrib>Zhu, Song-Chun</creatorcontrib><title>Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization</title><title>IEEE transactions on pattern analysis and machine intelligence</title><addtitle>TPAMI</addtitle><addtitle>IEEE Trans Pattern Anal Mach Intell</addtitle><description>In this paper, we present a 4D human-object interaction (4DHOI) model for solving three vision tasks jointly: i) event segmentation from a video sequence, ii) event recognition and parsing, and iii) contextual object localization. The 4DHOI model represents the geometric, temporal, and semantic relations in daily events involving human-object interactions. In 3D space, the interactions of human poses and contextual objects are modeled by semantic co-occurrence and geometric compatibility. On the time axis, the interactions are represented as a sequence of atomic event transitions with coherent objects. The 4DHOI model is a hierarchical spatial-temporal graph representation which can be used for inferring scene functionality and object affordance. The graph structures and parameters are learned using an ordered expectation maximization algorithm which mines the spatial-temporal structures of events from RGB-D video samples. Given an input RGB-D video, the inference is performed by a dynamic programming beam search algorithm which simultaneously carries out event segmentation, recognition, and object localization. We collected a large multiview RGB-D event dataset which contains 3,815 video sequences and 383,036 RGB-D frames captured by three RGB-D cameras. The experimental results on three challenging datasets demonstrate the strength of the proposed method.</description><subject>Context modeling</subject><subject>Dynamic programming</subject><subject>event recognition</subject><subject>Hidden Markov models</subject><subject>Human-object interaction</subject><subject>Localization</subject><subject>object affordance</subject><subject>object localization</subject><subject>Object recognition</subject><subject>Robots</subject><subject>Search algorithms</subject><subject>Segmentation</subject><subject>Semantics</subject><subject>sequence segmentation</subject><subject>Solid modeling</subject><subject>Three-dimensional displays</subject><subject>Video sequences</subject><issn>0162-8828</issn><issn>1939-3539</issn><issn>2160-9292</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2017</creationdate><recordtype>article</recordtype><recordid>eNpdkVtLxDAQhYMoul7-gIIUfPHBrrk2yaN4XVlRvDyXNJ2uXdpEm1bQX2_WXX3wZYbhfHMY5iC0T_CYEKxPnx_O7iZjikk2pkJySegaGhHNdMoE0-toFBWaKkXVFtoOYY4x4QKzTbRFJRVcCT1Cr3e-hKZ2s4RfJDdDa1x6X8zB9snE9dAZ29fehaTyXXLra9cnlx8Q6xPM2tjNQj1JHsH6mauXg3FlsrKYemua-uuH2kUblWkC7K36Dnq5unw-v0mn99eT87NpapkgfaoqKU3BBXBWQKmE5UYSsLhQmmSKQma40pIZkCyzmpYWc1pWFVXKmKqCku2g46XvW-ffBwh93tbBQtMYB34IOVE0yzTlGY_o0T907ofOxesipSWmVKgFRZeU7XwIHVT5W1e3pvvMCc4XOeQ_OeSLHPJVDnHpcGU9FC2Ufyu_j4_AwRKoAeBPllxFQrNvcaCMdw</recordid><startdate>20170601</startdate><enddate>20170601</enddate><creator>Wei, Ping</creator><creator>Zhao, Yibiao</creator><creator>Zheng, Nanning</creator><creator>Zhu, Song-Chun</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-8535-9527</orcidid></search><sort><creationdate>20170601</creationdate><title>Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization</title><author>Wei, Ping ; Zhao, Yibiao ; Zheng, Nanning ; Zhu, Song-Chun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c351t-8f77ab45e43bed85c4a71ec0b891682e6a48973ae736c92dc042dff288aaffed3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2017</creationdate><topic>Context modeling</topic><topic>Dynamic programming</topic><topic>event recognition</topic><topic>Hidden Markov models</topic><topic>Human-object interaction</topic><topic>Localization</topic><topic>object affordance</topic><topic>object localization</topic><topic>Object recognition</topic><topic>Robots</topic><topic>Search algorithms</topic><topic>Segmentation</topic><topic>Semantics</topic><topic>sequence segmentation</topic><topic>Solid modeling</topic><topic>Three-dimensional displays</topic><topic>Video sequences</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wei, Ping</creatorcontrib><creatorcontrib>Zhao, Yibiao</creatorcontrib><creatorcontrib>Zheng, Nanning</creatorcontrib><creatorcontrib>Zhu, Song-Chun</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Xplore Digital Library</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on pattern analysis and machine intelligence</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wei, Ping</au><au>Zhao, Yibiao</au><au>Zheng, Nanning</au><au>Zhu, Song-Chun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization</atitle><jtitle>IEEE transactions on pattern analysis and machine intelligence</jtitle><stitle>TPAMI</stitle><addtitle>IEEE Trans Pattern Anal Mach Intell</addtitle><date>2017-06-01</date><risdate>2017</risdate><volume>39</volume><issue>6</issue><spage>1165</spage><epage>1179</epage><pages>1165-1179</pages><issn>0162-8828</issn><eissn>1939-3539</eissn><eissn>2160-9292</eissn><coden>ITPIDJ</coden><abstract>In this paper, we present a 4D human-object interaction (4DHOI) model for solving three vision tasks jointly: i) event segmentation from a video sequence, ii) event recognition and parsing, and iii) contextual object localization. The 4DHOI model represents the geometric, temporal, and semantic relations in daily events involving human-object interactions. In 3D space, the interactions of human poses and contextual objects are modeled by semantic co-occurrence and geometric compatibility. On the time axis, the interactions are represented as a sequence of atomic event transitions with coherent objects. The 4DHOI model is a hierarchical spatial-temporal graph representation which can be used for inferring scene functionality and object affordance. The graph structures and parameters are learned using an ordered expectation maximization algorithm which mines the spatial-temporal structures of events from RGB-D video samples. Given an input RGB-D video, the inference is performed by a dynamic programming beam search algorithm which simultaneously carries out event segmentation, recognition, and object localization. We collected a large multiview RGB-D event dataset which contains 3,815 video sequences and 383,036 RGB-D frames captured by three RGB-D cameras. The experimental results on three challenging datasets demonstrate the strength of the proposed method.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>27254859</pmid><doi>10.1109/TPAMI.2016.2574712</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0002-8535-9527</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 0162-8828
ispartof	IEEE transactions on pattern analysis and machine intelligence, 2017-06, Vol.39 (6), p.1165-1179
issn	0162-8828 1939-3539 2160-9292
language	eng
recordid	cdi_proquest_miscellaneous_1826692464
source	IEEE Electronic Library (IEL) Journals
subjects	Context modeling Dynamic programming event recognition Hidden Markov models Human-object interaction Localization object affordance object localization Object recognition Robots Search algorithms Segmentation Semantics sequence segmentation Solid modeling Three-dimensional displays Video sequences
title	Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T15%3A04%3A27IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Modeling%204D%20Human-Object%20Interactions%20for%20Joint%20Event%20Segmentation,%20Recognition,%20and%20Object%20Localization&rft.jtitle=IEEE%20transactions%20on%20pattern%20analysis%20and%20machine%20intelligence&rft.au=Wei,%20Ping&rft.date=2017-06-01&rft.volume=39&rft.issue=6&rft.spage=1165&rft.epage=1179&rft.pages=1165-1179&rft.issn=0162-8828&rft.eissn=1939-3539&rft.coden=ITPIDJ&rft_id=info:doi/10.1109/TPAMI.2016.2574712&rft_dat=%3Cproquest_ieee_%3E1897022584%3C/proquest_ieee_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c351t-8f77ab45e43bed85c4a71ec0b891682e6a48973ae736c92dc042dff288aaffed3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=1897022584&rft_id=info:pmid/27254859&rft_ieee_id=7482729&rfr_iscdi=true