Loading…
A Two-channel Attention Mechanism-based MobileNetV2 And Bidirectional Long Short Memory Network For Multi-modal Dimension Dance Emotion Recognition
In recent years, dance emotion recognition has become a hot research issue in the field of human-computer interaction and art design. However, multi-modal dimension emotion recognition can detect subtle action emotion changes. In multi-modal dimension emotion recognition, it is necessary to consider...
Saved in:
Published in: | Journal of Applied Science and Engineering 2023-04, Vol.26 (4), p.455-464 |
---|---|
Main Author: | |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | In recent years, dance emotion recognition has become a hot research issue in the field of human-computer interaction and art design. However, multi-modal dimension emotion recognition can detect subtle action emotion changes. In multi-modal dimension emotion recognition, it is necessary to consider how to effectively fuse different modal emotion information. Aiming at the problem of effective feature extraction and modal synchronization in feature layer fusion, and the correlation problem of different modal feature information in decision layer fusion, we propose a new method for multi-modal dimension dance emotion recognition. It takes the lightweight network structure MobileNetV2 as the benchmark network architecture and introduces an independently designed attention module to increase the weight of the saliency feature graph learned by the convolutional layer, so as to improve the classification accuracy. The spatial information of the proposed model is input in parallel with the channel information. By defining the small convolution of 1×1 and 3×3, the computational amount and complexity are reduced while ensuring the lightweight of the structure. The audio model, video model and multi-modal fusion model are respectively constructed to carry out deep feature learning for information flow. Finally, they are input to the two-channel bi-directional long short term memory to obtain the final motion emotion prediction value. Compared with other state-of-the-art methods, the proposed method can effectively capture emotion information in the high-level dimension and obtain the better recognition effect, so it can better fuse audio and video information. |
---|---|
ISSN: | 2708-9967 2708-9975 |
DOI: | 10.6180/jase.202304_26(4).0001 |