Loading…

Dynamic interactive learning network for audio-visual event localization

Audio-visual event (AVE) localization aims to detect whether an event exists in each video segment and predict its category. Only when the event is audible and visible can it be recognized as an AVE. However, sometimes the information from auditory and visual modalities is asymmetrical in a video se...

Full description

Saved in:
Bibliographic Details
Published in:Applied intelligence (Dordrecht, Netherlands) Netherlands), 2023-12, Vol.53 (24), p.30431-30442
Main Authors: Chen, Jincai, Liang, Han, Wang, Ruili, Zeng, Jiangfeng, Lu, Ping
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Audio-visual event (AVE) localization aims to detect whether an event exists in each video segment and predict its category. Only when the event is audible and visible can it be recognized as an AVE. However, sometimes the information from auditory and visual modalities is asymmetrical in a video sequence, leading to incorrect predictions. To address this challenge, we introduce a dynamic interactive learning network designed to dynamically explore the intra- and inter-modal relationships depending on the other modality for better AVE localization. Specifically, our approach involves a dynamic fusion attention of intra- and inter-modalities module, enabling the auditory and visual modalities to focus more on regions deemed informative by the other modality while focusing less on regions that the other modality considers noise. In addition, we introduce an audio-visual difference loss to reduce the distance between auditory and visual representations. Our proposed method has been demonstrated to have superior performance by extensive experimental results on the AVE dataset. The source code will be available at https://github.com/hanliang/DILN .
ISSN:0924-669X
1573-7497
DOI:10.1007/s10489-023-05146-7