Loading…

Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP

In expressive and controllable Text-to-Speech (TTS), explicit prosodic features significantly improve the naturalness and controllability of synthesised speech. However, manual prosody annotation is labor-intensive and inconsistent. To address this issue, a two-stage automatic annotation pipeline is...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2024-06
Main Authors:	Zhong, Jinzuomu, Yang, Li, Huang, Hui, Richmond, Korin, Liu, Jie, Su, Zhiba, Guo, Jing, Tang, Benlai, Zhu, Fengjie
Format:	Article
Language:	English
Subjects:	Annotations Coders Hammers Linguistics Speech Speech recognition
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	In expressive and controllable Text-to-Speech (TTS), explicit prosodic features significantly improve the naturalness and controllability of synthesised speech. However, manual prosody annotation is labor-intensive and inconsistent. To address this issue, a two-stage automatic annotation pipeline is novelly proposed in this paper. In the first stage, we use contrastive pretraining of Speech-Silence and Word-Punctuation (SSWP) pairs to enhance prosodic information in latent representations. In the second stage, we build a multi-modal prosody annotator, comprising pretrained encoders, a text-speech fusing scheme, and a sequence classifier. Experiments on English prosodic boundaries demonstrate that our method achieves state-of-the-art (SOTA) performance with 0.72 and 0.93 f1 score for Prosodic Word and Prosodic Phrase boundary respectively, while bearing remarkable robustness to data scarcity.
ISSN:	2331-8422