Loading…
Coarse-to-fine fusion for language grounding in 3D navigation
We present a new network whereby an agent navigates in the 3D environment to find a target object according to a language-based instruction. Such a task is challenging because the agent has to understand the instruction correctly and takes a series of actions to locate a target among others without...
Saved in:
Published in: | Knowledge-based systems 2023-10, Vol.277, p.110785, Article 110785 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | We present a new network whereby an agent navigates in the 3D environment to find a target object according to a language-based instruction. Such a task is challenging because the agent has to understand the instruction correctly and takes a series of actions to locate a target among others without colliding with obstacles. The essence of our proposed network consists of a coarse-to-fine fusion model to fuse language and vision and an autoencoder to encode visual information effectively. Then, an asynchronous reinforcement learning algorithm is used to coordinate detailed actions to complete the navigation task. Extensive evaluation using three different levels of the navigation task in the 3D Vizdoom environment suggests that our model outperforms the state-of-the-art. To see if the proposed network can deal with a real-world 3D environment for the navigation task, it is combined with Rec-BERT, which is based on REVERIE. The result suggests that it performs better, especially for unseen cases, and it is also useful to visualize what and when the agent pays attention to while it navigates in a complex indoor environment. |
---|---|
ISSN: | 0950-7051 1872-7409 |
DOI: | 10.1016/j.knosys.2023.110785 |