Loading…

Vision-Language Navigation Policy Learning and Adaptation

Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalizati...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on pattern analysis and machine intelligence 2021-12, Vol.43 (12), p.4205-4216
Main Authors: Wang, Xin, Huang, Qiuyuan, Celikyilmaz, Asli, Gao, Jianfeng, Shen, Dinghan, Wang, Yuan-Fang, Wang, William Yang, Zhang, Lei
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c351t-f98d8d4851c2df3c55e4494b611b822fa9576b515f6c4d1c71ca78ece498e4353
cites cdi_FETCH-LOGICAL-c351t-f98d8d4851c2df3c55e4494b611b822fa9576b515f6c4d1c71ca78ece498e4353
container_end_page 4216
container_issue 12
container_start_page 4205
container_title IEEE transactions on pattern analysis and machine intelligence
container_volume 43
creator Wang, Xin
Huang, Qiuyuan
Celikyilmaz, Asli
Gao, Jianfeng
Shen, Dinghan
Wang, Yuan-Fang
Wang, William Yang
Zhang, Lei
description Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms baseline methods by 10 percent on Success Rate weighted by Path Length (SPL) and achieves the state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore and adapt to unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7 to 11.7 percent).
doi_str_mv 10.1109/TPAMI.2020.2972281
format article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_proquest_miscellaneous_2355955939</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8986691</ieee_id><sourcerecordid>2592630555</sourcerecordid><originalsourceid>FETCH-LOGICAL-c351t-f98d8d4851c2df3c55e4494b611b822fa9576b515f6c4d1c71ca78ece498e4353</originalsourceid><addsrcrecordid>eNpdkF1LwzAUhoMoOqd_QEEK3njTma-TJpdj-DGYuovpbUjTtGR07WxaYf_e7sNdCIEDOc95OedB6IbgESFYPS7m47fpiGKKR1QllEpyggZEMRUzYOoUDTARNJaSygt0GcISY8IBs3N0wSgGDkIOkPrywddVPDNV0ZnCRe_mxxem7f-ieV16u4lmzjSVr4rIVFk0zsy63bWv0FluyuCuD3WIPp-fFpPXePbxMp2MZ7FlQNo4VzKTGZdALM1yZgEc54qngpBUUpobBYlIgUAuLM-ITYg1iXTWcSUd7w8Zood97rqpvzsXWr3ywbqyNJWru6ApA1D9Y6pH7_-hy7prqn47TUFRwTDANpDuKdvUITQu1-vGr0yz0QTrrVi9E6u3YvVBbD90d4ju0pXLjiN_Jnvgdg9459yxLZUUQhH2C9eNeqU</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2592630555</pqid></control><display><type>article</type><title>Vision-Language Navigation Policy Learning and Adaptation</title><source>IEEE Electronic Library (IEL) Journals</source><creator>Wang, Xin ; Huang, Qiuyuan ; Celikyilmaz, Asli ; Gao, Jianfeng ; Shen, Dinghan ; Wang, Yuan-Fang ; Wang, William Yang ; Zhang, Lei</creator><creatorcontrib>Wang, Xin ; Huang, Qiuyuan ; Celikyilmaz, Asli ; Gao, Jianfeng ; Shen, Dinghan ; Wang, Yuan-Fang ; Wang, William Yang ; Zhang, Lei</creatorcontrib><description>Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms baseline methods by 10 percent on Success Rate weighted by Path Length (SPL) and achieves the state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore and adapt to unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7 to 11.7 percent).</description><identifier>ISSN: 0162-8828</identifier><identifier>EISSN: 1939-3539</identifier><identifier>EISSN: 2160-9292</identifier><identifier>DOI: 10.1109/TPAMI.2020.2972281</identifier><identifier>PMID: 32054568</identifier><identifier>CODEN: ITPIDJ</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Cognition ; Grounding ; imitation learning ; Learning ; Matching ; multimodal machine learning ; Natural languages ; Navigation ; reinforcement learning ; Task analysis ; Trajectory ; Vision ; Vision-language navigation ; Visualization</subject><ispartof>IEEE transactions on pattern analysis and machine intelligence, 2021-12, Vol.43 (12), p.4205-4216</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c351t-f98d8d4851c2df3c55e4494b611b822fa9576b515f6c4d1c71ca78ece498e4353</citedby><cites>FETCH-LOGICAL-c351t-f98d8d4851c2df3c55e4494b611b822fa9576b515f6c4d1c71ca78ece498e4353</cites><orcidid>0000-0001-6926-0538 ; 0000-0003-2605-5504</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8986691$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,54796</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/32054568$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Wang, Xin</creatorcontrib><creatorcontrib>Huang, Qiuyuan</creatorcontrib><creatorcontrib>Celikyilmaz, Asli</creatorcontrib><creatorcontrib>Gao, Jianfeng</creatorcontrib><creatorcontrib>Shen, Dinghan</creatorcontrib><creatorcontrib>Wang, Yuan-Fang</creatorcontrib><creatorcontrib>Wang, William Yang</creatorcontrib><creatorcontrib>Zhang, Lei</creatorcontrib><title>Vision-Language Navigation Policy Learning and Adaptation</title><title>IEEE transactions on pattern analysis and machine intelligence</title><addtitle>TPAMI</addtitle><addtitle>IEEE Trans Pattern Anal Mach Intell</addtitle><description>Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms baseline methods by 10 percent on Success Rate weighted by Path Length (SPL) and achieves the state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore and adapt to unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7 to 11.7 percent).</description><subject>Cognition</subject><subject>Grounding</subject><subject>imitation learning</subject><subject>Learning</subject><subject>Matching</subject><subject>multimodal machine learning</subject><subject>Natural languages</subject><subject>Navigation</subject><subject>reinforcement learning</subject><subject>Task analysis</subject><subject>Trajectory</subject><subject>Vision</subject><subject>Vision-language navigation</subject><subject>Visualization</subject><issn>0162-8828</issn><issn>1939-3539</issn><issn>2160-9292</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><recordid>eNpdkF1LwzAUhoMoOqd_QEEK3njTma-TJpdj-DGYuovpbUjTtGR07WxaYf_e7sNdCIEDOc95OedB6IbgESFYPS7m47fpiGKKR1QllEpyggZEMRUzYOoUDTARNJaSygt0GcISY8IBs3N0wSgGDkIOkPrywddVPDNV0ZnCRe_mxxem7f-ieV16u4lmzjSVr4rIVFk0zsy63bWv0FluyuCuD3WIPp-fFpPXePbxMp2MZ7FlQNo4VzKTGZdALM1yZgEc54qngpBUUpobBYlIgUAuLM-ITYg1iXTWcSUd7w8Zood97rqpvzsXWr3ywbqyNJWru6ApA1D9Y6pH7_-hy7prqn47TUFRwTDANpDuKdvUITQu1-vGr0yz0QTrrVi9E6u3YvVBbD90d4ju0pXLjiN_Jnvgdg9459yxLZUUQhH2C9eNeqU</recordid><startdate>20211201</startdate><enddate>20211201</enddate><creator>Wang, Xin</creator><creator>Huang, Qiuyuan</creator><creator>Celikyilmaz, Asli</creator><creator>Gao, Jianfeng</creator><creator>Shen, Dinghan</creator><creator>Wang, Yuan-Fang</creator><creator>Wang, William Yang</creator><creator>Zhang, Lei</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0001-6926-0538</orcidid><orcidid>https://orcid.org/0000-0003-2605-5504</orcidid></search><sort><creationdate>20211201</creationdate><title>Vision-Language Navigation Policy Learning and Adaptation</title><author>Wang, Xin ; Huang, Qiuyuan ; Celikyilmaz, Asli ; Gao, Jianfeng ; Shen, Dinghan ; Wang, Yuan-Fang ; Wang, William Yang ; Zhang, Lei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c351t-f98d8d4851c2df3c55e4494b611b822fa9576b515f6c4d1c71ca78ece498e4353</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Cognition</topic><topic>Grounding</topic><topic>imitation learning</topic><topic>Learning</topic><topic>Matching</topic><topic>multimodal machine learning</topic><topic>Natural languages</topic><topic>Navigation</topic><topic>reinforcement learning</topic><topic>Task analysis</topic><topic>Trajectory</topic><topic>Vision</topic><topic>Vision-language navigation</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wang, Xin</creatorcontrib><creatorcontrib>Huang, Qiuyuan</creatorcontrib><creatorcontrib>Celikyilmaz, Asli</creatorcontrib><creatorcontrib>Gao, Jianfeng</creatorcontrib><creatorcontrib>Shen, Dinghan</creatorcontrib><creatorcontrib>Wang, Yuan-Fang</creatorcontrib><creatorcontrib>Wang, William Yang</creatorcontrib><creatorcontrib>Zhang, Lei</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on pattern analysis and machine intelligence</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wang, Xin</au><au>Huang, Qiuyuan</au><au>Celikyilmaz, Asli</au><au>Gao, Jianfeng</au><au>Shen, Dinghan</au><au>Wang, Yuan-Fang</au><au>Wang, William Yang</au><au>Zhang, Lei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Vision-Language Navigation Policy Learning and Adaptation</atitle><jtitle>IEEE transactions on pattern analysis and machine intelligence</jtitle><stitle>TPAMI</stitle><addtitle>IEEE Trans Pattern Anal Mach Intell</addtitle><date>2021-12-01</date><risdate>2021</risdate><volume>43</volume><issue>12</issue><spage>4205</spage><epage>4216</epage><pages>4205-4216</pages><issn>0162-8828</issn><eissn>1939-3539</eissn><eissn>2160-9292</eissn><coden>ITPIDJ</coden><abstract>Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms baseline methods by 10 percent on Success Rate weighted by Path Length (SPL) and achieves the state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore and adapt to unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7 to 11.7 percent).</abstract><cop>United States</cop><pub>IEEE</pub><pmid>32054568</pmid><doi>10.1109/TPAMI.2020.2972281</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0001-6926-0538</orcidid><orcidid>https://orcid.org/0000-0003-2605-5504</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 0162-8828
ispartof IEEE transactions on pattern analysis and machine intelligence, 2021-12, Vol.43 (12), p.4205-4216
issn 0162-8828
1939-3539
2160-9292
language eng
recordid cdi_proquest_miscellaneous_2355955939
source IEEE Electronic Library (IEL) Journals
subjects Cognition
Grounding
imitation learning
Learning
Matching
multimodal machine learning
Natural languages
Navigation
reinforcement learning
Task analysis
Trajectory
Vision
Vision-language navigation
Visualization
title Vision-Language Navigation Policy Learning and Adaptation
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T17%3A29%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Vision-Language%20Navigation%20Policy%20Learning%20and%20Adaptation&rft.jtitle=IEEE%20transactions%20on%20pattern%20analysis%20and%20machine%20intelligence&rft.au=Wang,%20Xin&rft.date=2021-12-01&rft.volume=43&rft.issue=12&rft.spage=4205&rft.epage=4216&rft.pages=4205-4216&rft.issn=0162-8828&rft.eissn=1939-3539&rft.coden=ITPIDJ&rft_id=info:doi/10.1109/TPAMI.2020.2972281&rft_dat=%3Cproquest_pubme%3E2592630555%3C/proquest_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c351t-f98d8d4851c2df3c55e4494b611b822fa9576b515f6c4d1c71ca78ece498e4353%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2592630555&rft_id=info:pmid/32054568&rft_ieee_id=8986691&rfr_iscdi=true