Loading…

Cross-Modal Interaction via Reinforcement Feedback for Audio-Lyrics Retrieval

The task of retrieving audio content relevant to lyric queries and vice versa plays a critical role in music-oriented applications. In this process, robust feature representations have to be learned for two modalities. Furthermore, interactions between different modalities should be properly capture...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2024, Vol.32, p.1248-1260
Main Authors:	Zhou, Dong, Lei, Fang, Li, Lin, Zhou, Yongmei, Yang, Aimin
Format:	Article
Language:	English
Subjects:	Acoustics Alignment audio-lyrics dataset Audio-lyrics retrieval Benchmark testing cross-modal interaction Datasets Feedback Lyrics Neural networks reinforcement feedback mechanism Reinforcement learning Representations Retrieval Semantics Speech processing Task analysis
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The task of retrieving audio content relevant to lyric queries and vice versa plays a critical role in music-oriented applications. In this process, robust feature representations have to be learned for two modalities. Furthermore, interactions between different modalities should be properly captured at a fine-grained level. Existing approaches can effectively extract modal representations and perform retrieving between different modalities through alignment. However, these approaches model interactions between audio and lyrics in a coarse-grained manner. Especially the input features and interactions between enhanced representations produced by the alignment module are largely ignored, resulting in low-quality modality representations for final retrieval. This paper presents a novel method named CMRF that accomplishes cross-modal interactions via a reinforcement feedback procedure to learn high-quality multi-modal embeddings. Initially, we implicitly assimilate representations across distinct modalities via directional pairwise cross-modal attention. Subsequently, our approach recurrently identifies pivotal constituents within these elevated-level attributes to engage with the primary input features via reinforcement learning, thus augmenting the quality of multi-modal embeddings. In addition, we introduce a novel audio-lyrics dataset AL-song , which consists of paired audio with corresponding lyrics for the audio-lyrics retrieval task. The empirical findings derived from the AL-song dataset and the benchmark dataset Sounddescs substantiate the efficacy and efficiency of CMRF when juxtaposed with state-of-the-art methodologies.
ISSN:	2329-9290 2329-9304
DOI:	10.1109/TASLP.2024.3358048