Loading…
Vision-Language Navigation Policy Learning and Adaptation
Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalizati...
Saved in:
Published in: | IEEE transactions on pattern analysis and machine intelligence 2021-12, Vol.43 (12), p.4205-4216 |
---|---|
Main Authors: | , , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c351t-f98d8d4851c2df3c55e4494b611b822fa9576b515f6c4d1c71ca78ece498e4353 |
---|---|
cites | cdi_FETCH-LOGICAL-c351t-f98d8d4851c2df3c55e4494b611b822fa9576b515f6c4d1c71ca78ece498e4353 |
container_end_page | 4216 |
container_issue | 12 |
container_start_page | 4205 |
container_title | IEEE transactions on pattern analysis and machine intelligence |
container_volume | 43 |
creator | Wang, Xin Huang, Qiuyuan Celikyilmaz, Asli Gao, Jianfeng Shen, Dinghan Wang, Yuan-Fang Wang, William Yang Zhang, Lei |
description | Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms baseline methods by 10 percent on Success Rate weighted by Path Length (SPL) and achieves the state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore and adapt to unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7 to 11.7 percent). |
doi_str_mv | 10.1109/TPAMI.2020.2972281 |
format | article |
fullrecord | <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_proquest_miscellaneous_2355955939</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8986691</ieee_id><sourcerecordid>2592630555</sourcerecordid><originalsourceid>FETCH-LOGICAL-c351t-f98d8d4851c2df3c55e4494b611b822fa9576b515f6c4d1c71ca78ece498e4353</originalsourceid><addsrcrecordid>eNpdkF1LwzAUhoMoOqd_QEEK3njTma-TJpdj-DGYuovpbUjTtGR07WxaYf_e7sNdCIEDOc95OedB6IbgESFYPS7m47fpiGKKR1QllEpyggZEMRUzYOoUDTARNJaSygt0GcISY8IBs3N0wSgGDkIOkPrywddVPDNV0ZnCRe_mxxem7f-ieV16u4lmzjSVr4rIVFk0zsy63bWv0FluyuCuD3WIPp-fFpPXePbxMp2MZ7FlQNo4VzKTGZdALM1yZgEc54qngpBUUpobBYlIgUAuLM-ITYg1iXTWcSUd7w8Zood97rqpvzsXWr3ywbqyNJWru6ApA1D9Y6pH7_-hy7prqn47TUFRwTDANpDuKdvUITQu1-vGr0yz0QTrrVi9E6u3YvVBbD90d4ju0pXLjiN_Jnvgdg9459yxLZUUQhH2C9eNeqU</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2592630555</pqid></control><display><type>article</type><title>Vision-Language Navigation Policy Learning and Adaptation</title><source>IEEE Electronic Library (IEL) Journals</source><creator>Wang, Xin ; Huang, Qiuyuan ; Celikyilmaz, Asli ; Gao, Jianfeng ; Shen, Dinghan ; Wang, Yuan-Fang ; Wang, William Yang ; Zhang, Lei</creator><creatorcontrib>Wang, Xin ; Huang, Qiuyuan ; Celikyilmaz, Asli ; Gao, Jianfeng ; Shen, Dinghan ; Wang, Yuan-Fang ; Wang, William Yang ; Zhang, Lei</creatorcontrib><description>Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms baseline methods by 10 percent on Success Rate weighted by Path Length (SPL) and achieves the state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore and adapt to unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7 to 11.7 percent).</description><identifier>ISSN: 0162-8828</identifier><identifier>EISSN: 1939-3539</identifier><identifier>EISSN: 2160-9292</identifier><identifier>DOI: 10.1109/TPAMI.2020.2972281</identifier><identifier>PMID: 32054568</identifier><identifier>CODEN: ITPIDJ</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Cognition ; Grounding ; imitation learning ; Learning ; Matching ; multimodal machine learning ; Natural languages ; Navigation ; reinforcement learning ; Task analysis ; Trajectory ; Vision ; Vision-language navigation ; Visualization</subject><ispartof>IEEE transactions on pattern analysis and machine intelligence, 2021-12, Vol.43 (12), p.4205-4216</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c351t-f98d8d4851c2df3c55e4494b611b822fa9576b515f6c4d1c71ca78ece498e4353</citedby><cites>FETCH-LOGICAL-c351t-f98d8d4851c2df3c55e4494b611b822fa9576b515f6c4d1c71ca78ece498e4353</cites><orcidid>0000-0001-6926-0538 ; 0000-0003-2605-5504</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8986691$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,54796</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/32054568$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Wang, Xin</creatorcontrib><creatorcontrib>Huang, Qiuyuan</creatorcontrib><creatorcontrib>Celikyilmaz, Asli</creatorcontrib><creatorcontrib>Gao, Jianfeng</creatorcontrib><creatorcontrib>Shen, Dinghan</creatorcontrib><creatorcontrib>Wang, Yuan-Fang</creatorcontrib><creatorcontrib>Wang, William Yang</creatorcontrib><creatorcontrib>Zhang, Lei</creatorcontrib><title>Vision-Language Navigation Policy Learning and Adaptation</title><title>IEEE transactions on pattern analysis and machine intelligence</title><addtitle>TPAMI</addtitle><addtitle>IEEE Trans Pattern Anal Mach Intell</addtitle><description>Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms baseline methods by 10 percent on Success Rate weighted by Path Length (SPL) and achieves the state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore and adapt to unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7 to 11.7 percent).</description><subject>Cognition</subject><subject>Grounding</subject><subject>imitation learning</subject><subject>Learning</subject><subject>Matching</subject><subject>multimodal machine learning</subject><subject>Natural languages</subject><subject>Navigation</subject><subject>reinforcement learning</subject><subject>Task analysis</subject><subject>Trajectory</subject><subject>Vision</subject><subject>Vision-language navigation</subject><subject>Visualization</subject><issn>0162-8828</issn><issn>1939-3539</issn><issn>2160-9292</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><recordid>eNpdkF1LwzAUhoMoOqd_QEEK3njTma-TJpdj-DGYuovpbUjTtGR07WxaYf_e7sNdCIEDOc95OedB6IbgESFYPS7m47fpiGKKR1QllEpyggZEMRUzYOoUDTARNJaSygt0GcISY8IBs3N0wSgGDkIOkPrywddVPDNV0ZnCRe_mxxem7f-ieV16u4lmzjSVr4rIVFk0zsy63bWv0FluyuCuD3WIPp-fFpPXePbxMp2MZ7FlQNo4VzKTGZdALM1yZgEc54qngpBUUpobBYlIgUAuLM-ITYg1iXTWcSUd7w8Zood97rqpvzsXWr3ywbqyNJWru6ApA1D9Y6pH7_-hy7prqn47TUFRwTDANpDuKdvUITQu1-vGr0yz0QTrrVi9E6u3YvVBbD90d4ju0pXLjiN_Jnvgdg9459yxLZUUQhH2C9eNeqU</recordid><startdate>20211201</startdate><enddate>20211201</enddate><creator>Wang, Xin</creator><creator>Huang, Qiuyuan</creator><creator>Celikyilmaz, Asli</creator><creator>Gao, Jianfeng</creator><creator>Shen, Dinghan</creator><creator>Wang, Yuan-Fang</creator><creator>Wang, William Yang</creator><creator>Zhang, Lei</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0001-6926-0538</orcidid><orcidid>https://orcid.org/0000-0003-2605-5504</orcidid></search><sort><creationdate>20211201</creationdate><title>Vision-Language Navigation Policy Learning and Adaptation</title><author>Wang, Xin ; Huang, Qiuyuan ; Celikyilmaz, Asli ; Gao, Jianfeng ; Shen, Dinghan ; Wang, Yuan-Fang ; Wang, William Yang ; Zhang, Lei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c351t-f98d8d4851c2df3c55e4494b611b822fa9576b515f6c4d1c71ca78ece498e4353</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Cognition</topic><topic>Grounding</topic><topic>imitation learning</topic><topic>Learning</topic><topic>Matching</topic><topic>multimodal machine learning</topic><topic>Natural languages</topic><topic>Navigation</topic><topic>reinforcement learning</topic><topic>Task analysis</topic><topic>Trajectory</topic><topic>Vision</topic><topic>Vision-language navigation</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wang, Xin</creatorcontrib><creatorcontrib>Huang, Qiuyuan</creatorcontrib><creatorcontrib>Celikyilmaz, Asli</creatorcontrib><creatorcontrib>Gao, Jianfeng</creatorcontrib><creatorcontrib>Shen, Dinghan</creatorcontrib><creatorcontrib>Wang, Yuan-Fang</creatorcontrib><creatorcontrib>Wang, William Yang</creatorcontrib><creatorcontrib>Zhang, Lei</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on pattern analysis and machine intelligence</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wang, Xin</au><au>Huang, Qiuyuan</au><au>Celikyilmaz, Asli</au><au>Gao, Jianfeng</au><au>Shen, Dinghan</au><au>Wang, Yuan-Fang</au><au>Wang, William Yang</au><au>Zhang, Lei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Vision-Language Navigation Policy Learning and Adaptation</atitle><jtitle>IEEE transactions on pattern analysis and machine intelligence</jtitle><stitle>TPAMI</stitle><addtitle>IEEE Trans Pattern Anal Mach Intell</addtitle><date>2021-12-01</date><risdate>2021</risdate><volume>43</volume><issue>12</issue><spage>4205</spage><epage>4216</epage><pages>4205-4216</pages><issn>0162-8828</issn><eissn>1939-3539</eissn><eissn>2160-9292</eissn><coden>ITPIDJ</coden><abstract>Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms baseline methods by 10 percent on Success Rate weighted by Path Length (SPL) and achieves the state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore and adapt to unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7 to 11.7 percent).</abstract><cop>United States</cop><pub>IEEE</pub><pmid>32054568</pmid><doi>10.1109/TPAMI.2020.2972281</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0001-6926-0538</orcidid><orcidid>https://orcid.org/0000-0003-2605-5504</orcidid></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0162-8828 |
ispartof | IEEE transactions on pattern analysis and machine intelligence, 2021-12, Vol.43 (12), p.4205-4216 |
issn | 0162-8828 1939-3539 2160-9292 |
language | eng |
recordid | cdi_proquest_miscellaneous_2355955939 |
source | IEEE Electronic Library (IEL) Journals |
subjects | Cognition Grounding imitation learning Learning Matching multimodal machine learning Natural languages Navigation reinforcement learning Task analysis Trajectory Vision Vision-language navigation Visualization |
title | Vision-Language Navigation Policy Learning and Adaptation |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T17%3A29%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Vision-Language%20Navigation%20Policy%20Learning%20and%20Adaptation&rft.jtitle=IEEE%20transactions%20on%20pattern%20analysis%20and%20machine%20intelligence&rft.au=Wang,%20Xin&rft.date=2021-12-01&rft.volume=43&rft.issue=12&rft.spage=4205&rft.epage=4216&rft.pages=4205-4216&rft.issn=0162-8828&rft.eissn=1939-3539&rft.coden=ITPIDJ&rft_id=info:doi/10.1109/TPAMI.2020.2972281&rft_dat=%3Cproquest_pubme%3E2592630555%3C/proquest_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c351t-f98d8d4851c2df3c55e4494b611b822fa9576b515f6c4d1c71ca78ece498e4353%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2592630555&rft_id=info:pmid/32054568&rft_ieee_id=8986691&rfr_iscdi=true |