Loading…

Coarse-to-fine fusion for language grounding in 3D navigation

We present a new network whereby an agent navigates in the 3D environment to find a target object according to a language-based instruction. Such a task is challenging because the agent has to understand the instruction correctly and takes a series of actions to locate a target among others without...

Full description

Saved in:
Bibliographic Details
Published in:Knowledge-based systems 2023-10, Vol.277, p.110785, Article 110785
Main Authors: Nguyen, Thanh Tin, Vo, Anh H., Choi, Soo-Mi, Kim, Yong-Guk
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:We present a new network whereby an agent navigates in the 3D environment to find a target object according to a language-based instruction. Such a task is challenging because the agent has to understand the instruction correctly and takes a series of actions to locate a target among others without colliding with obstacles. The essence of our proposed network consists of a coarse-to-fine fusion model to fuse language and vision and an autoencoder to encode visual information effectively. Then, an asynchronous reinforcement learning algorithm is used to coordinate detailed actions to complete the navigation task. Extensive evaluation using three different levels of the navigation task in the 3D Vizdoom environment suggests that our model outperforms the state-of-the-art. To see if the proposed network can deal with a real-world 3D environment for the navigation task, it is combined with Rec-BERT, which is based on REVERIE. The result suggests that it performs better, especially for unseen cases, and it is also useful to visualize what and when the agent pays attention to while it navigates in a complex indoor environment.
ISSN:0950-7051
1872-7409
DOI:10.1016/j.knosys.2023.110785