Loading…
Multimodal Sentimental Privileged Information Embedding for Improving Facial Expression Recognition
Facial expression recognition (FER) has always been one of the key task in affective computing. Over the years, researchers have worked to improve the performance of FER by designing models with more powerful feature extraction, embedding attention mechanism, and reconstructing missing information,...
Saved in:
Published in: | IEEE transactions on affective computing 2024-06, p.1-12 |
---|---|
Main Authors: | , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Facial expression recognition (FER) has always been one of the key task in affective computing. Over the years, researchers have worked to improve the performance of FER by designing models with more powerful feature extraction, embedding attention mechanism, and reconstructing missing information, etc. Different from the paradigms above, we attempt to improve FER performance by using multimodal sentiment data, such as audio and text, as privileged information (PI) for facial images. To this end, a multimodal privileged information embedded facial expression recognition network (MPI-FER) is proposed in this paper. During the training phase, this model achieves the PI embedding of multimodal data for FER by developing cross-modality translation between multimodal sentiment data. During the test phase, input images alone are sufficient for the model inference to accomplish the FER task input. The MPI-FER is a large-scale, heterogeneous deep neural network. To achieve effective training of this model with limited training samples, we design a multi-stage training strategy of module-wise pre-training followed by end-to-end fine-tuning. In addition, a strategy of filling the multimodal sentiment quaternion is proposed for implementing our method on a facial expression database consisting only of face images. We conducted extensive experiments to evaluate the proposed method on two databases of multimodal sentiment analysis (CH-SIMS and CMU-MOSI) and two databases of FER in the wild (RAF-DB and AffectNet). The results show that embedding multimodal sentiment data as privileged information into the FER task based on face images can significantly improve the accuracy of FER. Furthermore, by only using image in the test phase, the proposed method can achieve better results of multimodal sentiment analysis than those methods achieved by using multimodal sentimental data fusion. |
---|---|
ISSN: | 1949-3045 1949-3045 |
DOI: | 10.1109/TAFFC.2024.3415625 |