Loading…
Supervised Contrastive Learning for Robust and Efficient Multi-modal Emotion and Sentiment Analysis
Expression of human emotion and sentiment are often multi-modal consisting use of spoken speech, vision, and text. Combining multiple modalities allows learning-based models to benefit with the complementary information present across modalities to produce more accurate predictions. One of the bigge...
Saved in:
Main Authors: | , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Expression of human emotion and sentiment are often multi-modal consisting use of spoken speech, vision, and text. Combining multiple modalities allows learning-based models to benefit with the complementary information present across modalities to produce more accurate predictions. One of the bigger challenges in multi-modal affective computing is performance consistency in non-ideal scenarios. Most benchmarks fail to generalize in non-ideal scenarios where one of the modalities is missing or highly corrupted due to occlusion, sensor errors, or change of orientation. Consequently, various modality fusion approaches were proposed. However, most of these fusion approaches assume that each modality is equally useful. To address the challenge of performance consistency, in this work we propose to use supervised contrastive learning (SCL). We demonstrate through various experiments and comparison with state-of-the-art (SOTA) methods that the model robustness against corrupted and missing modalities improves when trained with SCL. Next, we use the Perceiver architecture [1] in order to efficiently combine the representations of different modalities. Its iterative attention mechanism allows to create a reduced latent representation in an efficient manner. We observe that it can accommodate a wide range of modality combinations, allowing for robust information fusion. Our approach allows reduction of model complexity and efficient fusion of different modalities, while maintaining the performance consistency and model robustness. We conduct ablation experiments to study the effect of each contribution in different scenarios, and we show that the proposed methods outperform the state-of-art, while simultaneously being robust to corrupted modalities. Our method also outperforms its counterparts and SOTA while using less numerical complexity (inference times and compute operations). |
---|---|
ISSN: | 2831-7475 |
DOI: | 10.1109/ICPR56361.2022.9956637 |