Loading…

Urban Sound & Sight: Dataset And Benchmark For Audio-Visual Urban Scene Understanding

Automatic audio-visual urban traffic understanding is a growing area of research with many potential applications of value to industry, academia, and the public sector. Yet, the lack of well-curated resources for training and evaluating models to research in this area hinders their development. To a...

Full description

Saved in:
Bibliographic Details
Main Authors: Fuentes, Magdalena, Steers, Bea, Zinemanas, Pablo, Rocamora, Martin, Bondi, Luca, Wilkins, Julia, Shi, Qianyi, Hou, Yao, Das, Samarjit, Serra, Xavier, Bello, Juan Pablo
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Automatic audio-visual urban traffic understanding is a growing area of research with many potential applications of value to industry, academia, and the public sector. Yet, the lack of well-curated resources for training and evaluating models to research in this area hinders their development. To address this we present a curated audio-visual dataset, Urban Sound & Sight (Urbansas), developed for investigating the detection and localization of sounding vehicles in the wild. Urbansas consists of 12 hours of unlabeled data along with 3 hours of manually annotated data, including bounding boxes with classes and unique id of vehicles, and strong audio labels featuring vehicle types and indicating off-screen sounds. We discuss the challenges presented by the dataset and how to use its annotations for the localization of vehicles in the wild through audio models.
ISSN:2379-190X
DOI:10.1109/ICASSP43922.2022.9747644