Loading…
Cross-Modal Interaction via Reinforcement Feedback for Audio-Lyrics Retrieval
The task of retrieving audio content relevant to lyric queries and vice versa plays a critical role in music-oriented applications. In this process, robust feature representations have to be learned for two modalities. Furthermore, interactions between different modalities should be properly capture...
Saved in:
Published in: | IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2024, Vol.32, p.1248-1260 |
---|---|
Main Authors: | , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The task of retrieving audio content relevant to lyric queries and vice versa plays a critical role in music-oriented applications. In this process, robust feature representations have to be learned for two modalities. Furthermore, interactions between different modalities should be properly captured at a fine-grained level. Existing approaches can effectively extract modal representations and perform retrieving between different modalities through alignment. However, these approaches model interactions between audio and lyrics in a coarse-grained manner. Especially the input features and interactions between enhanced representations produced by the alignment module are largely ignored, resulting in low-quality modality representations for final retrieval. This paper presents a novel method named CMRF that accomplishes cross-modal interactions via a reinforcement feedback procedure to learn high-quality multi-modal embeddings. Initially, we implicitly assimilate representations across distinct modalities via directional pairwise cross-modal attention. Subsequently, our approach recurrently identifies pivotal constituents within these elevated-level attributes to engage with the primary input features via reinforcement learning, thus augmenting the quality of multi-modal embeddings. In addition, we introduce a novel audio-lyrics dataset AL-song , which consists of paired audio with corresponding lyrics for the audio-lyrics retrieval task. The empirical findings derived from the AL-song dataset and the benchmark dataset Sounddescs substantiate the efficacy and efficiency of CMRF when juxtaposed with state-of-the-art methodologies. |
---|---|
ISSN: | 2329-9290 2329-9304 |
DOI: | 10.1109/TASLP.2024.3358048 |