Loading…

Multimodal Target Localization With Landmark-Aware Positioning for Urban Mobility

Advancements in vehicle automation technology are expected to significantly impact how humans interact with vehicles. In this study, we propose a method to create user-friendly control interfaces for autonomous vehicles in urban environments. The proposed model predicts the vehicle's destinatio...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE robotics and automation letters 2025-01, Vol.10 (1), p.716-723
Main Authors:	Hosomi, Naoki, Iioka, Yui, Hatanaka, Shumpei, Misu, Teruhisa, Yamada, Kentaro, Tsukamoto, Nanami, Kobayashi, Shunsuke, Sugiura, Komei
Format:	Article
Language:	English
Subjects:	Cameras Data analysis Datasets deep learning for visual perception Deep learning methods Feature extraction Image segmentation Location awareness multi-modal perception for HRI Natural languages Optical fiber sensors Positioning Predictive models Urban areas Urban environments Vehicles Visual tasks Visualization
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Advancements in vehicle automation technology are expected to significantly impact how humans interact with vehicles. In this study, we propose a method to create user-friendly control interfaces for autonomous vehicles in urban environments. The proposed model predicts the vehicle's destination on the images captured by the vehicle's cameras based on high-level navigation instructions. Our data analysis found that users often specify the destination based on the relative positions of landmarks in a scene. The task is challenging because users can specify arbitrary destinations on roads, which do not have distinct visual characteristics for prediction. Thus, the model should consider relationships between landmarks and the ideal stopping position. Existing approaches only model the relationships between instructions and destinations and do not explicitly model the relative positional relationships between landmarks and destinations. To address this limitation, the proposed Target Regressor in Positioning (TRiP) model includes a novel loss function, Landmark-aware Absolute-Relative Target Position Loss, and two novel modules, Target Position Localizer and Multi-Resolution Referring Expression Comprehension Feature Extractor. To validate TRiP, we built a new dataset by extending an existing dataset of referring expression comprehension. The model was evaluated on the dataset using a standard metric, and the results showed that TRiP significantly outperformed the baseline method.
ISSN:	2377-3766 2377-3766
DOI:	10.1109/LRA.2024.3511404