Loading…

Vision-Language Navigation Policy Learning and Adaptation

Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalizati...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on pattern analysis and machine intelligence 2021-12, Vol.43 (12), p.4205-4216
Main Authors:	Wang, Xin, Huang, Qiuyuan, Celikyilmaz, Asli, Gao, Jianfeng, Shen, Dinghan, Wang, Yuan-Fang, Wang, William Yang, Zhang, Lei
Format:	Article
Language:	English
Subjects:	Cognition Grounding imitation learning Learning Matching multimodal machine learning Natural languages Navigation reinforcement learning Task analysis Trajectory Vision Vision-language navigation Visualization
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c351t-f98d8d4851c2df3c55e4494b611b822fa9576b515f6c4d1c71ca78ece498e4353
cites	cdi_FETCH-LOGICAL-c351t-f98d8d4851c2df3c55e4494b611b822fa9576b515f6c4d1c71ca78ece498e4353
container_end_page	4216
container_issue	12
container_start_page	4205
container_title	IEEE transactions on pattern analysis and machine intelligence
container_volume	43
creator	Wang, Xin Huang, Qiuyuan Celikyilmaz, Asli Gao, Jianfeng Shen, Dinghan Wang, Yuan-Fang Wang, William Yang Zhang, Lei
description	Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms baseline methods by 10 percent on Success Rate weighted by Path Length (SPL) and achieves the state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore and adapt to unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7 to 11.7 percent).
doi_str_mv	10.1109/TPAMI.2020.2972281
format	article
fullrecord	<record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_proquest_miscellaneous_2355955939</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8986691</ieee_id><sourcerecordid>2592630555</sourcerecordid><originalsourceid>FETCH-LOGICAL-c351t-f98d8d4851c2df3c55e4494b611b822fa9576b515f6c4d1c71ca78ece498e4353</originalsourceid><addsrcrecordid>eNpdkF1LwzAUhoMoOqd_QEEK3njTma-TJpdj-DGYuovpbUjTtGR07WxaYf_e7sNdCIEDOc95OedB6IbgESFYPS7m47fpiGKKR1QllEpyggZEMRUzYOoUDTARNJaSygt0GcISY8IBs3N0wSgGDkIOkPrywddVPDNV0ZnCRe_mxxem7f-ieV16u4lmzjSVr4rIVFk0zsy63bWv0FluyuCuD3WIPp-fFpPXePbxMp2MZ7FlQNo4VzKTGZdALM1yZgEc54qngpBUUpobBYlIgUAuLM-ITYg1iXTWcSUd7w8Zood97rqpvzsXWr3ywbqyNJWru6ApA1D9Y6pH7_-hy7prqn47TUFRwTDANpDuKdvUITQu1-vGr0yz0QTrrVi9E6u3YvVBbD90d4ju0pXLjiN_Jnvgdg9459yxLZUUQhH2C9eNeqU</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2592630555</pqid></control><display><type>article</type><title>Vision-Language Navigation Policy Learning and Adaptation</title><source>IEEE Electronic Library (IEL) Journals</source><creator>Wang, Xin ; Huang, Qiuyuan ; Celikyilmaz, Asli ; Gao, Jianfeng ; Shen, Dinghan ; Wang, Yuan-Fang ; Wang, William Yang ; Zhang, Lei</creator><creatorcontrib>Wang, Xin ; Huang, Qiuyuan ; Celikyilmaz, Asli ; Gao, Jianfeng ; Shen, Dinghan ; Wang, Yuan-Fang ; Wang, William Yang ; Zhang, Lei</creatorcontrib><description>Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms baseline methods by 10 percent on Success Rate weighted by Path Length (SPL) and achieves the state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore and adapt to unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7 to 11.7 percent).</description><identifier>ISSN: 0162-8828</identifier><identifier>EISSN: 1939-3539</identifier><identifier>EISSN: 2160-9292</identifier><identifier>DOI: 10.1109/TPAMI.2020.2972281</identifier><identifier>PMID: 32054568</identifier><identifier>CODEN: ITPIDJ</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Cognition ; Grounding ; imitation learning ; Learning ; Matching ; multimodal machine learning ; Natural languages ; Navigation ; reinforcement learning ; Task analysis ; Trajectory ; Vision ; Vision-language navigation ; Visualization</subject><ispartof>IEEE transactions on pattern analysis and machine intelligence, 2021-12, Vol.43 (12), p.4205-4216</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c351t-f98d8d4851c2df3c55e4494b611b822fa9576b515f6c4d1c71ca78ece498e4353</citedby><cites>FETCH-LOGICAL-c351t-f98d8d4851c2df3c55e4494b611b822fa9576b515f6c4d1c71ca78ece498e4353</cites><orcidid>0000-0001-6926-0538 ; 0000-0003-2605-5504</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8986691$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,54796</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/32054568$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Wang, Xin</creatorcontrib><creatorcontrib>Huang, Qiuyuan</creatorcontrib><creatorcontrib>Celikyilmaz, Asli</creatorcontrib><creatorcontrib>Gao, Jianfeng</creatorcontrib><creatorcontrib>Shen, Dinghan</creatorcontrib><creatorcontrib>Wang, Yuan-Fang</creatorcontrib><creatorcontrib>Wang, William Yang</creatorcontrib><creatorcontrib>Zhang, Lei</creatorcontrib><title>Vision-Language Navigation Policy Learning and Adaptation</title><title>IEEE transactions on pattern analysis and machine intelligence</title><addtitle>TPAMI</addtitle><addtitle>IEEE Trans Pattern Anal Mach Intell</addtitle><description>Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms baseline methods by 10 percent on Success Rate weighted by Path Length (SPL) and achieves the state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore and adapt to unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7 to 11.7 percent).</description><subject>Cognition</subject><subject>Grounding</subject><subject>imitation learning</subject><subject>Learning</subject><subject>Matching</subject><subject>multimodal machine learning</subject><subject>Natural languages</subject><subject>Navigation</subject><subject>reinforcement learning</subject><subject>Task analysis</subject><subject>Trajectory</subject><subject>Vision</subject><subject>Vision-language navigation</subject><subject>Visualization</subject><issn>0162-8828</issn><issn>1939-3539</issn><issn>2160-9292</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><recordid>eNpdkF1LwzAUhoMoOqd_QEEK3njTma-TJpdj-DGYuovpbUjTtGR07WxaYf_e7sNdCIEDOc95OedB6IbgESFYPS7m47fpiGKKR1QllEpyggZEMRUzYOoUDTARNJaSygt0GcISY8IBs3N0wSgGDkIOkPrywddVPDNV0ZnCRe_mxxem7f-ieV16u4lmzjSVr4rIVFk0zsy63bWv0FluyuCuD3WIPp-fFpPXePbxMp2MZ7FlQNo4VzKTGZdALM1yZgEc54qngpBUUpobBYlIgUAuLM-ITYg1iXTWcSUd7w8Zood97rqpvzsXWr3ywbqyNJWru6ApA1D9Y6pH7_-hy7prqn47TUFRwTDANpDuKdvUITQu1-vGr0yz0QTrrVi9E6u3YvVBbD90d4ju0pXLjiN_Jnvgdg9459yxLZUUQhH2C9eNeqU</recordid><startdate>20211201</startdate><enddate>20211201</enddate><creator>Wang, Xin</creator><creator>Huang, Qiuyuan</creator><creator>Celikyilmaz, Asli</creator><creator>Gao, Jianfeng</creator><creator>Shen, Dinghan</creator><creator>Wang, Yuan-Fang</creator><creator>Wang, William Yang</creator><creator>Zhang, Lei</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0001-6926-0538</orcidid><orcidid>https://orcid.org/0000-0003-2605-5504</orcidid></search><sort><creationdate>20211201</creationdate><title>Vision-Language Navigation Policy Learning and Adaptation</title><author>Wang, Xin ; Huang, Qiuyuan ; Celikyilmaz, Asli ; Gao, Jianfeng ; Shen, Dinghan ; Wang, Yuan-Fang ; Wang, William Yang ; Zhang, Lei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c351t-f98d8d4851c2df3c55e4494b611b822fa9576b515f6c4d1c71ca78ece498e4353</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Cognition</topic><topic>Grounding</topic><topic>imitation learning</topic><topic>Learning</topic><topic>Matching</topic><topic>multimodal machine learning</topic><topic>Natural languages</topic><topic>Navigation</topic><topic>reinforcement learning</topic><topic>Task analysis</topic><topic>Trajectory</topic><topic>Vision</topic><topic>Vision-language navigation</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wang, Xin</creatorcontrib><creatorcontrib>Huang, Qiuyuan</creatorcontrib><creatorcontrib>Celikyilmaz, Asli</creatorcontrib><creatorcontrib>Gao, Jianfeng</creatorcontrib><creatorcontrib>Shen, Dinghan</creatorcontrib><creatorcontrib>Wang, Yuan-Fang</creatorcontrib><creatorcontrib>Wang, William Yang</creatorcontrib><creatorcontrib>Zhang, Lei</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on pattern analysis and machine intelligence</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wang, Xin</au><au>Huang, Qiuyuan</au><au>Celikyilmaz, Asli</au><au>Gao, Jianfeng</au><au>Shen, Dinghan</au><au>Wang, Yuan-Fang</au><au>Wang, William Yang</au><au>Zhang, Lei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Vision-Language Navigation Policy Learning and Adaptation</atitle><jtitle>IEEE transactions on pattern analysis and machine intelligence</jtitle><stitle>TPAMI</stitle><addtitle>IEEE Trans Pattern Anal Mach Intell</addtitle><date>2021-12-01</date><risdate>2021</risdate><volume>43</volume><issue>12</issue><spage>4205</spage><epage>4216</epage><pages>4205-4216</pages><issn>0162-8828</issn><eissn>1939-3539</eissn><eissn>2160-9292</eissn><coden>ITPIDJ</coden><abstract>Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms baseline methods by 10 percent on Success Rate weighted by Path Length (SPL) and achieves the state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore and adapt to unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7 to 11.7 percent).</abstract><cop>United States</cop><pub>IEEE</pub><pmid>32054568</pmid><doi>10.1109/TPAMI.2020.2972281</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0001-6926-0538</orcidid><orcidid>https://orcid.org/0000-0003-2605-5504</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 0162-8828
ispartof	IEEE transactions on pattern analysis and machine intelligence, 2021-12, Vol.43 (12), p.4205-4216
issn	0162-8828 1939-3539 2160-9292
language	eng
recordid	cdi_proquest_miscellaneous_2355955939
source	IEEE Electronic Library (IEL) Journals
subjects	Cognition Grounding imitation learning Learning Matching multimodal machine learning Natural languages Navigation reinforcement learning Task analysis Trajectory Vision Vision-language navigation Visualization
title	Vision-Language Navigation Policy Learning and Adaptation
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T17%3A29%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Vision-Language%20Navigation%20Policy%20Learning%20and%20Adaptation&rft.jtitle=IEEE%20transactions%20on%20pattern%20analysis%20and%20machine%20intelligence&rft.au=Wang,%20Xin&rft.date=2021-12-01&rft.volume=43&rft.issue=12&rft.spage=4205&rft.epage=4216&rft.pages=4205-4216&rft.issn=0162-8828&rft.eissn=1939-3539&rft.coden=ITPIDJ&rft_id=info:doi/10.1109/TPAMI.2020.2972281&rft_dat=%3Cproquest_pubme%3E2592630555%3C/proquest_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c351t-f98d8d4851c2df3c55e4494b611b822fa9576b515f6c4d1c71ca78ece498e4353%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2592630555&rft_id=info:pmid/32054568&rft_ieee_id=8986691&rfr_iscdi=true