Loading…

Trimodal Navigable Region Segmentation Model: Grounding Navigation Instructions in Urban Areas

In this study, we develop a model that enables mobilities to have more friendly interactions with users. Specifically, we focus on the referring navigable regions task in which a model grounds navigable regions of the road using the mobility's camera image and natural language navigation instru...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE robotics and automation letters 2024-05, Vol.9 (5), p.1-8
Main Authors:	Hosomi, Naoki, Hatanaka, Shumpei, Iioka, Yui, Yang, Wei, Kuyo, Katsuyuki, Misu, Teruhisa, Yamada, Kentaro, Sugiura, Komei
Format:	Article
Language:	English
Subjects:	Changing environments Encoders-Decoders Image segmentation Lighting Natural language processing Natural languages Navigation Object Detection Segmentation and Categorization Semantic Scene Understanding Semantic segmentation Semantics Task analysis Visualization
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	In this study, we develop a model that enables mobilities to have more friendly interactions with users. Specifically, we focus on the referring navigable regions task in which a model grounds navigable regions of the road using the mobility's camera image and natural language navigation instructions. This task is challenging because of the requirement of vision-and-language comprehension in situations that involve rapidly changing environments with other mobilities. The performance of existing methods is insufficient, partly because they do not consider features related to scene context, such as semantic segmentation information. Therefore, it is important to incorporate these features into a multimodal encoder. In this study, we propose a trimodal (three modalities of language, image, and mask) encoder-decoder model called the Trimodal Navigable Region Segmentation Model. We introduce the Text-Mask Encoder Block to process semantic segmentation masks and the Day-Night Classification Branch to balance the input modalities. We validated our model on the Talk2Car-RegSeg dataset. The results demonstrated that our method outperformed the baseline method for standard metrics.
ISSN:	2377-3766 2377-3766
DOI:	10.1109/LRA.2024.3376957