Loading…

Look, Listen and Pay More Attention: Fusing Multi-Modal Information for Video Violence Detection

Violence detection is an essential and challenging problem in the computer vision community. Most existing works focus on single modal data analysis, which is not effective when multi-modality is available. Therefore, we propose a two-stage multi-modal information fusion method for violence detectio...

Full description

Saved in:
Bibliographic Details
Main Authors: Wei, Dong-Lai, Liu, Chen-Geng, Liu, Yang, Liu, Jing, Zhu, Xiao-Guang, Zeng, Xin-Hua
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Violence detection is an essential and challenging problem in the computer vision community. Most existing works focus on single modal data analysis, which is not effective when multi-modality is available. Therefore, we propose a two-stage multi-modal information fusion method for violence detection: 1) the first stage adopts multiple instance learning strategies to refine video-level hard labels into clip-level soft labels, and 2) the next stage uses multi-modal information fused attention module to achieve fusion, and supervised learning is carried out using the soft labels generated at the first stage. Extensive empirical evidence on the XD-Violence dataset shows that our method outperforms the state-of-the-art methods.
ISSN:2379-190X
DOI:10.1109/ICASSP43922.2022.9746422