Loading…
Towards Asynchronous Multimodal Signal Interaction and Fusion via Tailored Transformers
The signals from human expressions are usually multimodal, including natural language, facial gestures, and acoustic behaviors. A key challenge is how to fuse multimodal time-series signals with temporal asynchrony. To this end, we present a Transformer-driven Signal Interaction and Fusion (TSIF) ap...
Saved in:
Published in: | IEEE signal processing letters 2024, Vol.31, p.1550-1554 |
---|---|
Main Authors: | , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The signals from human expressions are usually multimodal, including natural language, facial gestures, and acoustic behaviors. A key challenge is how to fuse multimodal time-series signals with temporal asynchrony. To this end, we present a Transformer-driven Signal Interaction and Fusion (TSIF) approach to effectively model asynchronous multimodal signal sequences. TSIF consists of linear and cross-modal transformer modules with different duties. The linear transformer module efficiently performs the global interaction for multimodal signals, and the vital philosophy is to replace the dot product similarity with the Exponential Kernel while achieving linear complexity by a low-rank matrix decomposition. By targeting the language modality, the cross-modal transformer module aims to capture reliable element correlations among distinct signals and mitigate noise interference in audio and visual modalities. Numerous experiments on two multimodal benchmarks show that our TSIF comparably outperforms previous state-of-the-art models with lower space-time complexities. The systematic analysis also proves the effectiveness of the proposed modules. |
---|---|
ISSN: | 1070-9908 1558-2361 |
DOI: | 10.1109/LSP.2024.3409211 |