Loading…
Speech/music classification using phase-based and magnitude-based features
Detection of speech and music is an essential preprocessing step for many high-level audio-based applications like speaker diarization and music information retrieval. Researchers have previously used various magnitude-based features in this task. In comparison, the phase spectrum has received lesse...
Saved in:
Published in: | Speech communication 2022-07, Vol.142, p.34-48 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Detection of speech and music is an essential preprocessing step for many high-level audio-based applications like speaker diarization and music information retrieval. Researchers have previously used various magnitude-based features in this task. In comparison, the phase spectrum has received lesser attention. The phase of a signal is believed to carry non-trivial information that can help determine its audio class. This work explores three existing phase-based features for speech vs. music classification. The potential of phase information is highlighted through statistical significance tests and canonical correlation analyses. The proposed approach is benchmarked against four baseline magnitude-based feature sets. This work also contributes an annotated audio dataset named Movie - MUSNOMIX of 8 h and 20 min duration, comprising seven audio classes, including speech and music. The Movie - MUSNOMIX dataset and widely used public datasets like MUSAN, GTZAN, Scheirer–Slaney, and Muspeak have been used for performance evaluations. In combination with magnitude-based ones, phase-based features improve upon the baseline performance consistently for the datasets used. Moreover, various combinations of phase and magnitude-based features show satisfactory generalization capability over the two datasets. The performances of phase-based features in identifying speech and music signals corrupted with different environmental noise at various SNR levels are also reported. Last but not least, a preliminary study on the efficacy of phase-based features in segmenting continuous sequences of speech and music signals is also provided. The codes used in this work and the contributed dataset have been made freely available.
•This work explores phase information in the task of speech vs. music classification.•Statistical significance and generalization ability of phase features are reported.•Performance with signals corrupted with noises at various SNR levels is reported.•An 8 h and 20 min hindi movie audio dataset with seven audio classes is contributed.•Effectiveness in segmenting sequences of speech and music signals is also studied. |
---|---|
ISSN: | 0167-6393 1872-7182 |
DOI: | 10.1016/j.specom.2022.06.005 |