Loading…
ARDuP: Active Region Video Diffusion for Universal Policies
Sequential decision-making can be formulated as a text-conditioned video generation problem, where a video planner, guided by a text-defined goal, generates future frames visualizing planned actions, from which control actions are subsequently derived. In this work, we introduce Active Region Video...
Saved in:
Main Authors: | , , , , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | 8472 |
container_issue | |
container_start_page | 8465 |
container_title | |
container_volume | |
creator | Huang, Shuaiyi Levy, Mara Jiang, Zhenyu Anandkumar, Anima Zhu, Yuke Fan, Linxi Huang, De-An Shrivastava, Abhinav |
description | Sequential decision-making can be formulated as a text-conditioned video generation problem, where a video planner, guided by a text-defined goal, generates future frames visualizing planned actions, from which control actions are subsequently derived. In this work, we introduce Active Region Video Diffusion for Universal Policies (ARDuP), a novel framework for video-based policy learning that emphasizes the generation of active regions, i.e. potential interaction areas, enhancing the conditional policy's focus on interactive areas critical for task execution. This innovative framework integrates active region conditioning with latent diffusion models for video planning and employs latent representations for direct action decoding during inverse dynamic modeling. By utilizing motion cues in videos for automatic active region discovery, our method eliminates the need for manual annotations of active regions. We validate ARDuP's efficacy via extensive experiments on simulator CLIPort and the real-world dataset BridgeData v2, achieving notable improvements in success rates and generating convincingly realistic video plans. |
doi_str_mv | 10.1109/IROS58592.2024.10802264 |
format | conference_proceeding |
fullrecord | <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10802264</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10802264</ieee_id><sourcerecordid>10802264</sourcerecordid><originalsourceid>FETCH-LOGICAL-i704-fcb8c0f9522c91c83afb5dea3413d3f21a194b7d321f3270084ecc32185820ef3</originalsourceid><addsrcrecordid>eNo1j9FKwzAYRqMwcGx9A8G8QOufP0mT6FXZnA4GG3V6O9I0kUhdpXGCb29Fvfo4cDjwEXLFoGAMzPW63j5KLQ0WCCgKBhoQS3FGMqOM5hK4UgrkOZkikzwHXZYXJEvpFQAYjIopp-S2qpen3Q2t3Ef89LT2L7E_0ufY-p4uYwin9MOhH-jTcRSGZDu667vook9zMgm2Sz772xnZr-72i4d8s71fL6pNHhWIPLhGOwhGIjrDnOY2NLL1lgvGWx6QWWZEo1qOLHBUAFp450bSUiP4wGfk8jcbvfeH9yG-2eHr8H-XfwORJEhS</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>ARDuP: Active Region Video Diffusion for Universal Policies</title><source>IEEE Xplore All Conference Series</source><creator>Huang, Shuaiyi ; Levy, Mara ; Jiang, Zhenyu ; Anandkumar, Anima ; Zhu, Yuke ; Fan, Linxi ; Huang, De-An ; Shrivastava, Abhinav</creator><creatorcontrib>Huang, Shuaiyi ; Levy, Mara ; Jiang, Zhenyu ; Anandkumar, Anima ; Zhu, Yuke ; Fan, Linxi ; Huang, De-An ; Shrivastava, Abhinav</creatorcontrib><description>Sequential decision-making can be formulated as a text-conditioned video generation problem, where a video planner, guided by a text-defined goal, generates future frames visualizing planned actions, from which control actions are subsequently derived. In this work, we introduce Active Region Video Diffusion for Universal Policies (ARDuP), a novel framework for video-based policy learning that emphasizes the generation of active regions, i.e. potential interaction areas, enhancing the conditional policy's focus on interactive areas critical for task execution. This innovative framework integrates active region conditioning with latent diffusion models for video planning and employs latent representations for direct action decoding during inverse dynamic modeling. By utilizing motion cues in videos for automatic active region discovery, our method eliminates the need for manual annotations of active regions. We validate ARDuP's efficacy via extensive experiments on simulator CLIPort and the real-world dataset BridgeData v2, achieving notable improvements in success rates and generating convincingly realistic video plans.</description><identifier>EISSN: 2153-0866</identifier><identifier>EISBN: 9798350377705</identifier><identifier>DOI: 10.1109/IROS58592.2024.10802264</identifier><language>eng</language><publisher>IEEE</publisher><subject>Annotations ; Decoding ; Diffusion models ; Dynamics ; Intelligent robots ; Inverse problems ; Manuals ; Planning ; Training ; Visualization</subject><ispartof>Proceedings of the ... IEEE/RSJ International Conference on Intelligent Robots and Systems, 2024, p.8465-8472</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10802264$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,777,781,786,787,27906,54536,54913</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10802264$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Huang, Shuaiyi</creatorcontrib><creatorcontrib>Levy, Mara</creatorcontrib><creatorcontrib>Jiang, Zhenyu</creatorcontrib><creatorcontrib>Anandkumar, Anima</creatorcontrib><creatorcontrib>Zhu, Yuke</creatorcontrib><creatorcontrib>Fan, Linxi</creatorcontrib><creatorcontrib>Huang, De-An</creatorcontrib><creatorcontrib>Shrivastava, Abhinav</creatorcontrib><title>ARDuP: Active Region Video Diffusion for Universal Policies</title><title>Proceedings of the ... IEEE/RSJ International Conference on Intelligent Robots and Systems</title><addtitle>IROS</addtitle><description>Sequential decision-making can be formulated as a text-conditioned video generation problem, where a video planner, guided by a text-defined goal, generates future frames visualizing planned actions, from which control actions are subsequently derived. In this work, we introduce Active Region Video Diffusion for Universal Policies (ARDuP), a novel framework for video-based policy learning that emphasizes the generation of active regions, i.e. potential interaction areas, enhancing the conditional policy's focus on interactive areas critical for task execution. This innovative framework integrates active region conditioning with latent diffusion models for video planning and employs latent representations for direct action decoding during inverse dynamic modeling. By utilizing motion cues in videos for automatic active region discovery, our method eliminates the need for manual annotations of active regions. We validate ARDuP's efficacy via extensive experiments on simulator CLIPort and the real-world dataset BridgeData v2, achieving notable improvements in success rates and generating convincingly realistic video plans.</description><subject>Annotations</subject><subject>Decoding</subject><subject>Diffusion models</subject><subject>Dynamics</subject><subject>Intelligent robots</subject><subject>Inverse problems</subject><subject>Manuals</subject><subject>Planning</subject><subject>Training</subject><subject>Visualization</subject><issn>2153-0866</issn><isbn>9798350377705</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2024</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNo1j9FKwzAYRqMwcGx9A8G8QOufP0mT6FXZnA4GG3V6O9I0kUhdpXGCb29Fvfo4cDjwEXLFoGAMzPW63j5KLQ0WCCgKBhoQS3FGMqOM5hK4UgrkOZkikzwHXZYXJEvpFQAYjIopp-S2qpen3Q2t3Ef89LT2L7E_0ufY-p4uYwin9MOhH-jTcRSGZDu667vook9zMgm2Sz772xnZr-72i4d8s71fL6pNHhWIPLhGOwhGIjrDnOY2NLL1lgvGWx6QWWZEo1qOLHBUAFp450bSUiP4wGfk8jcbvfeH9yG-2eHr8H-XfwORJEhS</recordid><startdate>20241014</startdate><enddate>20241014</enddate><creator>Huang, Shuaiyi</creator><creator>Levy, Mara</creator><creator>Jiang, Zhenyu</creator><creator>Anandkumar, Anima</creator><creator>Zhu, Yuke</creator><creator>Fan, Linxi</creator><creator>Huang, De-An</creator><creator>Shrivastava, Abhinav</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>20241014</creationdate><title>ARDuP: Active Region Video Diffusion for Universal Policies</title><author>Huang, Shuaiyi ; Levy, Mara ; Jiang, Zhenyu ; Anandkumar, Anima ; Zhu, Yuke ; Fan, Linxi ; Huang, De-An ; Shrivastava, Abhinav</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i704-fcb8c0f9522c91c83afb5dea3413d3f21a194b7d321f3270084ecc32185820ef3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Annotations</topic><topic>Decoding</topic><topic>Diffusion models</topic><topic>Dynamics</topic><topic>Intelligent robots</topic><topic>Inverse problems</topic><topic>Manuals</topic><topic>Planning</topic><topic>Training</topic><topic>Visualization</topic><toplevel>online_resources</toplevel><creatorcontrib>Huang, Shuaiyi</creatorcontrib><creatorcontrib>Levy, Mara</creatorcontrib><creatorcontrib>Jiang, Zhenyu</creatorcontrib><creatorcontrib>Anandkumar, Anima</creatorcontrib><creatorcontrib>Zhu, Yuke</creatorcontrib><creatorcontrib>Fan, Linxi</creatorcontrib><creatorcontrib>Huang, De-An</creatorcontrib><creatorcontrib>Shrivastava, Abhinav</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Xplore (Online service)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Huang, Shuaiyi</au><au>Levy, Mara</au><au>Jiang, Zhenyu</au><au>Anandkumar, Anima</au><au>Zhu, Yuke</au><au>Fan, Linxi</au><au>Huang, De-An</au><au>Shrivastava, Abhinav</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>ARDuP: Active Region Video Diffusion for Universal Policies</atitle><btitle>Proceedings of the ... IEEE/RSJ International Conference on Intelligent Robots and Systems</btitle><stitle>IROS</stitle><date>2024-10-14</date><risdate>2024</risdate><spage>8465</spage><epage>8472</epage><pages>8465-8472</pages><eissn>2153-0866</eissn><eisbn>9798350377705</eisbn><abstract>Sequential decision-making can be formulated as a text-conditioned video generation problem, where a video planner, guided by a text-defined goal, generates future frames visualizing planned actions, from which control actions are subsequently derived. In this work, we introduce Active Region Video Diffusion for Universal Policies (ARDuP), a novel framework for video-based policy learning that emphasizes the generation of active regions, i.e. potential interaction areas, enhancing the conditional policy's focus on interactive areas critical for task execution. This innovative framework integrates active region conditioning with latent diffusion models for video planning and employs latent representations for direct action decoding during inverse dynamic modeling. By utilizing motion cues in videos for automatic active region discovery, our method eliminates the need for manual annotations of active regions. We validate ARDuP's efficacy via extensive experiments on simulator CLIPort and the real-world dataset BridgeData v2, achieving notable improvements in success rates and generating convincingly realistic video plans.</abstract><pub>IEEE</pub><doi>10.1109/IROS58592.2024.10802264</doi><tpages>8</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | EISSN: 2153-0866 |
ispartof | Proceedings of the ... IEEE/RSJ International Conference on Intelligent Robots and Systems, 2024, p.8465-8472 |
issn | 2153-0866 |
language | eng |
recordid | cdi_ieee_primary_10802264 |
source | IEEE Xplore All Conference Series |
subjects | Annotations Decoding Diffusion models Dynamics Intelligent robots Inverse problems Manuals Planning Training Visualization |
title | ARDuP: Active Region Video Diffusion for Universal Policies |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T21%3A57%3A44IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=ARDuP:%20Active%20Region%20Video%20Diffusion%20for%20Universal%20Policies&rft.btitle=Proceedings%20of%20the%20...%20IEEE/RSJ%20International%20Conference%20on%20Intelligent%20Robots%20and%20Systems&rft.au=Huang,%20Shuaiyi&rft.date=2024-10-14&rft.spage=8465&rft.epage=8472&rft.pages=8465-8472&rft.eissn=2153-0866&rft_id=info:doi/10.1109/IROS58592.2024.10802264&rft.eisbn=9798350377705&rft_dat=%3Cieee_CHZPO%3E10802264%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i704-fcb8c0f9522c91c83afb5dea3413d3f21a194b7d321f3270084ecc32185820ef3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10802264&rfr_iscdi=true |