Loading…

Cross-Modal Interaction via Reinforcement Feedback for Audio-Lyrics Retrieval

The task of retrieving audio content relevant to lyric queries and vice versa plays a critical role in music-oriented applications. In this process, robust feature representations have to be learned for two modalities. Furthermore, interactions between different modalities should be properly capture...

Full description

Saved in:
Bibliographic Details
Published in:IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2024, Vol.32, p.1248-1260
Main Authors: Zhou, Dong, Lei, Fang, Li, Lin, Zhou, Yongmei, Yang, Aimin
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The task of retrieving audio content relevant to lyric queries and vice versa plays a critical role in music-oriented applications. In this process, robust feature representations have to be learned for two modalities. Furthermore, interactions between different modalities should be properly captured at a fine-grained level. Existing approaches can effectively extract modal representations and perform retrieving between different modalities through alignment. However, these approaches model interactions between audio and lyrics in a coarse-grained manner. Especially the input features and interactions between enhanced representations produced by the alignment module are largely ignored, resulting in low-quality modality representations for final retrieval. This paper presents a novel method named CMRF that accomplishes cross-modal interactions via a reinforcement feedback procedure to learn high-quality multi-modal embeddings. Initially, we implicitly assimilate representations across distinct modalities via directional pairwise cross-modal attention. Subsequently, our approach recurrently identifies pivotal constituents within these elevated-level attributes to engage with the primary input features via reinforcement learning, thus augmenting the quality of multi-modal embeddings. In addition, we introduce a novel audio-lyrics dataset AL-song , which consists of paired audio with corresponding lyrics for the audio-lyrics retrieval task. The empirical findings derived from the AL-song dataset and the benchmark dataset Sounddescs substantiate the efficacy and efficiency of CMRF when juxtaposed with state-of-the-art methodologies.
ISSN:2329-9290
2329-9304
DOI:10.1109/TASLP.2024.3358048