Loading…
Trimodal Navigable Region Segmentation Model: Grounding Navigation Instructions in Urban Areas
In this study, we develop a model that enables mobilities to have more friendly interactions with users. Specifically, we focus on the referring navigable regions task in which a model grounds navigable regions of the road using the mobility's camera image and natural language navigation instru...
Saved in:
Published in: | IEEE robotics and automation letters 2024-05, Vol.9 (5), p.1-8 |
---|---|
Main Authors: | , , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | In this study, we develop a model that enables mobilities to have more friendly interactions with users. Specifically, we focus on the referring navigable regions task in which a model grounds navigable regions of the road using the mobility's camera image and natural language navigation instructions. This task is challenging because of the requirement of vision-and-language comprehension in situations that involve rapidly changing environments with other mobilities. The performance of existing methods is insufficient, partly because they do not consider features related to scene context, such as semantic segmentation information. Therefore, it is important to incorporate these features into a multimodal encoder. In this study, we propose a trimodal (three modalities of language, image, and mask) encoder-decoder model called the Trimodal Navigable Region Segmentation Model. We introduce the Text-Mask Encoder Block to process semantic segmentation masks and the Day-Night Classification Branch to balance the input modalities. We validated our model on the Talk2Car-RegSeg dataset. The results demonstrated that our method outperformed the baseline method for standard metrics. |
---|---|
ISSN: | 2377-3766 2377-3766 |
DOI: | 10.1109/LRA.2024.3376957 |