Loading…

Towards Asynchronous Multimodal Signal Interaction and Fusion via Tailored Transformers

The signals from human expressions are usually multimodal, including natural language, facial gestures, and acoustic behaviors. A key challenge is how to fuse multimodal time-series signals with temporal asynchrony. To this end, we present a Transformer-driven Signal Interaction and Fusion (TSIF) ap...

Full description

Saved in:
Bibliographic Details
Published in:IEEE signal processing letters 2024, Vol.31, p.1550-1554
Main Authors: Yang, Dingkang, Kuang, Haopeng, Yang, Kun, Li, Mingcheng, Zhang, Lihua
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The signals from human expressions are usually multimodal, including natural language, facial gestures, and acoustic behaviors. A key challenge is how to fuse multimodal time-series signals with temporal asynchrony. To this end, we present a Transformer-driven Signal Interaction and Fusion (TSIF) approach to effectively model asynchronous multimodal signal sequences. TSIF consists of linear and cross-modal transformer modules with different duties. The linear transformer module efficiently performs the global interaction for multimodal signals, and the vital philosophy is to replace the dot product similarity with the Exponential Kernel while achieving linear complexity by a low-rank matrix decomposition. By targeting the language modality, the cross-modal transformer module aims to capture reliable element correlations among distinct signals and mitigate noise interference in audio and visual modalities. Numerous experiments on two multimodal benchmarks show that our TSIF comparably outperforms previous state-of-the-art models with lower space-time complexities. The systematic analysis also proves the effectiveness of the proposed modules.
ISSN:1070-9908
1558-2361
DOI:10.1109/LSP.2024.3409211