Loading…

ARDuP: Active Region Video Diffusion for Universal Policies

Sequential decision-making can be formulated as a text-conditioned video generation problem, where a video planner, guided by a text-defined goal, generates future frames visualizing planned actions, from which control actions are subsequently derived. In this work, we introduce Active Region Video...

Full description

Saved in:
Bibliographic Details
Main Authors: Huang, Shuaiyi, Levy, Mara, Jiang, Zhenyu, Anandkumar, Anima, Zhu, Yuke, Fan, Linxi, Huang, De-An, Shrivastava, Abhinav
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page 8472
container_issue
container_start_page 8465
container_title
container_volume
creator Huang, Shuaiyi
Levy, Mara
Jiang, Zhenyu
Anandkumar, Anima
Zhu, Yuke
Fan, Linxi
Huang, De-An
Shrivastava, Abhinav
description Sequential decision-making can be formulated as a text-conditioned video generation problem, where a video planner, guided by a text-defined goal, generates future frames visualizing planned actions, from which control actions are subsequently derived. In this work, we introduce Active Region Video Diffusion for Universal Policies (ARDuP), a novel framework for video-based policy learning that emphasizes the generation of active regions, i.e. potential interaction areas, enhancing the conditional policy's focus on interactive areas critical for task execution. This innovative framework integrates active region conditioning with latent diffusion models for video planning and employs latent representations for direct action decoding during inverse dynamic modeling. By utilizing motion cues in videos for automatic active region discovery, our method eliminates the need for manual annotations of active regions. We validate ARDuP's efficacy via extensive experiments on simulator CLIPort and the real-world dataset BridgeData v2, achieving notable improvements in success rates and generating convincingly realistic video plans.
doi_str_mv 10.1109/IROS58592.2024.10802264
format conference_proceeding
fullrecord <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10802264</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10802264</ieee_id><sourcerecordid>10802264</sourcerecordid><originalsourceid>FETCH-LOGICAL-i704-fcb8c0f9522c91c83afb5dea3413d3f21a194b7d321f3270084ecc32185820ef3</originalsourceid><addsrcrecordid>eNo1j9FKwzAYRqMwcGx9A8G8QOufP0mT6FXZnA4GG3V6O9I0kUhdpXGCb29Fvfo4cDjwEXLFoGAMzPW63j5KLQ0WCCgKBhoQS3FGMqOM5hK4UgrkOZkikzwHXZYXJEvpFQAYjIopp-S2qpen3Q2t3Ef89LT2L7E_0ufY-p4uYwin9MOhH-jTcRSGZDu667vook9zMgm2Sz772xnZr-72i4d8s71fL6pNHhWIPLhGOwhGIjrDnOY2NLL1lgvGWx6QWWZEo1qOLHBUAFp450bSUiP4wGfk8jcbvfeH9yG-2eHr8H-XfwORJEhS</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>ARDuP: Active Region Video Diffusion for Universal Policies</title><source>IEEE Xplore All Conference Series</source><creator>Huang, Shuaiyi ; Levy, Mara ; Jiang, Zhenyu ; Anandkumar, Anima ; Zhu, Yuke ; Fan, Linxi ; Huang, De-An ; Shrivastava, Abhinav</creator><creatorcontrib>Huang, Shuaiyi ; Levy, Mara ; Jiang, Zhenyu ; Anandkumar, Anima ; Zhu, Yuke ; Fan, Linxi ; Huang, De-An ; Shrivastava, Abhinav</creatorcontrib><description>Sequential decision-making can be formulated as a text-conditioned video generation problem, where a video planner, guided by a text-defined goal, generates future frames visualizing planned actions, from which control actions are subsequently derived. In this work, we introduce Active Region Video Diffusion for Universal Policies (ARDuP), a novel framework for video-based policy learning that emphasizes the generation of active regions, i.e. potential interaction areas, enhancing the conditional policy's focus on interactive areas critical for task execution. This innovative framework integrates active region conditioning with latent diffusion models for video planning and employs latent representations for direct action decoding during inverse dynamic modeling. By utilizing motion cues in videos for automatic active region discovery, our method eliminates the need for manual annotations of active regions. We validate ARDuP's efficacy via extensive experiments on simulator CLIPort and the real-world dataset BridgeData v2, achieving notable improvements in success rates and generating convincingly realistic video plans.</description><identifier>EISSN: 2153-0866</identifier><identifier>EISBN: 9798350377705</identifier><identifier>DOI: 10.1109/IROS58592.2024.10802264</identifier><language>eng</language><publisher>IEEE</publisher><subject>Annotations ; Decoding ; Diffusion models ; Dynamics ; Intelligent robots ; Inverse problems ; Manuals ; Planning ; Training ; Visualization</subject><ispartof>Proceedings of the ... IEEE/RSJ International Conference on Intelligent Robots and Systems, 2024, p.8465-8472</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10802264$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,777,781,786,787,27906,54536,54913</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10802264$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Huang, Shuaiyi</creatorcontrib><creatorcontrib>Levy, Mara</creatorcontrib><creatorcontrib>Jiang, Zhenyu</creatorcontrib><creatorcontrib>Anandkumar, Anima</creatorcontrib><creatorcontrib>Zhu, Yuke</creatorcontrib><creatorcontrib>Fan, Linxi</creatorcontrib><creatorcontrib>Huang, De-An</creatorcontrib><creatorcontrib>Shrivastava, Abhinav</creatorcontrib><title>ARDuP: Active Region Video Diffusion for Universal Policies</title><title>Proceedings of the ... IEEE/RSJ International Conference on Intelligent Robots and Systems</title><addtitle>IROS</addtitle><description>Sequential decision-making can be formulated as a text-conditioned video generation problem, where a video planner, guided by a text-defined goal, generates future frames visualizing planned actions, from which control actions are subsequently derived. In this work, we introduce Active Region Video Diffusion for Universal Policies (ARDuP), a novel framework for video-based policy learning that emphasizes the generation of active regions, i.e. potential interaction areas, enhancing the conditional policy's focus on interactive areas critical for task execution. This innovative framework integrates active region conditioning with latent diffusion models for video planning and employs latent representations for direct action decoding during inverse dynamic modeling. By utilizing motion cues in videos for automatic active region discovery, our method eliminates the need for manual annotations of active regions. We validate ARDuP's efficacy via extensive experiments on simulator CLIPort and the real-world dataset BridgeData v2, achieving notable improvements in success rates and generating convincingly realistic video plans.</description><subject>Annotations</subject><subject>Decoding</subject><subject>Diffusion models</subject><subject>Dynamics</subject><subject>Intelligent robots</subject><subject>Inverse problems</subject><subject>Manuals</subject><subject>Planning</subject><subject>Training</subject><subject>Visualization</subject><issn>2153-0866</issn><isbn>9798350377705</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2024</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNo1j9FKwzAYRqMwcGx9A8G8QOufP0mT6FXZnA4GG3V6O9I0kUhdpXGCb29Fvfo4cDjwEXLFoGAMzPW63j5KLQ0WCCgKBhoQS3FGMqOM5hK4UgrkOZkikzwHXZYXJEvpFQAYjIopp-S2qpen3Q2t3Ef89LT2L7E_0ufY-p4uYwin9MOhH-jTcRSGZDu667vook9zMgm2Sz772xnZr-72i4d8s71fL6pNHhWIPLhGOwhGIjrDnOY2NLL1lgvGWx6QWWZEo1qOLHBUAFp450bSUiP4wGfk8jcbvfeH9yG-2eHr8H-XfwORJEhS</recordid><startdate>20241014</startdate><enddate>20241014</enddate><creator>Huang, Shuaiyi</creator><creator>Levy, Mara</creator><creator>Jiang, Zhenyu</creator><creator>Anandkumar, Anima</creator><creator>Zhu, Yuke</creator><creator>Fan, Linxi</creator><creator>Huang, De-An</creator><creator>Shrivastava, Abhinav</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>20241014</creationdate><title>ARDuP: Active Region Video Diffusion for Universal Policies</title><author>Huang, Shuaiyi ; Levy, Mara ; Jiang, Zhenyu ; Anandkumar, Anima ; Zhu, Yuke ; Fan, Linxi ; Huang, De-An ; Shrivastava, Abhinav</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i704-fcb8c0f9522c91c83afb5dea3413d3f21a194b7d321f3270084ecc32185820ef3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Annotations</topic><topic>Decoding</topic><topic>Diffusion models</topic><topic>Dynamics</topic><topic>Intelligent robots</topic><topic>Inverse problems</topic><topic>Manuals</topic><topic>Planning</topic><topic>Training</topic><topic>Visualization</topic><toplevel>online_resources</toplevel><creatorcontrib>Huang, Shuaiyi</creatorcontrib><creatorcontrib>Levy, Mara</creatorcontrib><creatorcontrib>Jiang, Zhenyu</creatorcontrib><creatorcontrib>Anandkumar, Anima</creatorcontrib><creatorcontrib>Zhu, Yuke</creatorcontrib><creatorcontrib>Fan, Linxi</creatorcontrib><creatorcontrib>Huang, De-An</creatorcontrib><creatorcontrib>Shrivastava, Abhinav</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Xplore (Online service)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Huang, Shuaiyi</au><au>Levy, Mara</au><au>Jiang, Zhenyu</au><au>Anandkumar, Anima</au><au>Zhu, Yuke</au><au>Fan, Linxi</au><au>Huang, De-An</au><au>Shrivastava, Abhinav</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>ARDuP: Active Region Video Diffusion for Universal Policies</atitle><btitle>Proceedings of the ... IEEE/RSJ International Conference on Intelligent Robots and Systems</btitle><stitle>IROS</stitle><date>2024-10-14</date><risdate>2024</risdate><spage>8465</spage><epage>8472</epage><pages>8465-8472</pages><eissn>2153-0866</eissn><eisbn>9798350377705</eisbn><abstract>Sequential decision-making can be formulated as a text-conditioned video generation problem, where a video planner, guided by a text-defined goal, generates future frames visualizing planned actions, from which control actions are subsequently derived. In this work, we introduce Active Region Video Diffusion for Universal Policies (ARDuP), a novel framework for video-based policy learning that emphasizes the generation of active regions, i.e. potential interaction areas, enhancing the conditional policy's focus on interactive areas critical for task execution. This innovative framework integrates active region conditioning with latent diffusion models for video planning and employs latent representations for direct action decoding during inverse dynamic modeling. By utilizing motion cues in videos for automatic active region discovery, our method eliminates the need for manual annotations of active regions. We validate ARDuP's efficacy via extensive experiments on simulator CLIPort and the real-world dataset BridgeData v2, achieving notable improvements in success rates and generating convincingly realistic video plans.</abstract><pub>IEEE</pub><doi>10.1109/IROS58592.2024.10802264</doi><tpages>8</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier EISSN: 2153-0866
ispartof Proceedings of the ... IEEE/RSJ International Conference on Intelligent Robots and Systems, 2024, p.8465-8472
issn 2153-0866
language eng
recordid cdi_ieee_primary_10802264
source IEEE Xplore All Conference Series
subjects Annotations
Decoding
Diffusion models
Dynamics
Intelligent robots
Inverse problems
Manuals
Planning
Training
Visualization
title ARDuP: Active Region Video Diffusion for Universal Policies
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T21%3A57%3A44IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=ARDuP:%20Active%20Region%20Video%20Diffusion%20for%20Universal%20Policies&rft.btitle=Proceedings%20of%20the%20...%20IEEE/RSJ%20International%20Conference%20on%20Intelligent%20Robots%20and%20Systems&rft.au=Huang,%20Shuaiyi&rft.date=2024-10-14&rft.spage=8465&rft.epage=8472&rft.pages=8465-8472&rft.eissn=2153-0866&rft_id=info:doi/10.1109/IROS58592.2024.10802264&rft.eisbn=9798350377705&rft_dat=%3Cieee_CHZPO%3E10802264%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i704-fcb8c0f9522c91c83afb5dea3413d3f21a194b7d321f3270084ecc32185820ef3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10802264&rfr_iscdi=true