Loading…

Data Augmentation based Cross-Lingual Multi-Speaker TTS using DL with Sentiment Analysis

Text to Speech (TTS) algorithms have made tremendous strides in recent years in terms of their ability to generate speech in a single language that sounds as natural as possible. However, because to a lack of available training data, the synthesis of speech from the same person in various languages...

Full description

Saved in:
Bibliographic Details
Published in:ACM transactions on Asian and low-resource language information processing 2023-10
Main Authors: B., Lalitha, V, Madhurima, Ch, Nandakrishna, Jampani, Satish Babu, J. N., Chandra Sekhar, P., Venkat Reddy
Format: Article
Language:English
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Text to Speech (TTS) algorithms have made tremendous strides in recent years in terms of their ability to generate speech in a single language that sounds as natural as possible. However, because to a lack of available training data, the synthesis of speech from the same person in various languages continues to be difficult. It might be challenging to locate people who are proficient in numerous languages at a level equivalent to native speakers. Voice conversion is one method that may be used to create a polyglot corpus, which can then be used to solve this problem. This entails making use of a voice representation model that has been trained on 53 different languages through the application of hybrid deep learning in order to capture speaker-invariant qualities. In this study, we present a novel approach for the conversion of voices across different languages by employing Generated Adversarial Networks (GANs) to train a multilingual TTS system. The concept of individual likeness loss in order to address the special difficulty of maintaining one's individual speaking identity during the training process can be offered. This work is focused to provide the impression that voice data coming from a variety of languages and speakers was produced by the same individual, and one way to do so is by using this word. In order to determine the extent to which our model is useful, two experiments that compared it against benchmarks that made use of varying degrees of parameter sharing between languages are carried out. The purpose of these experiments was to evaluate the accuracy of pronunciation as well as the caliber of the synthetic voice during transitions between different languages.
ISSN:2375-4699
2375-4702
DOI:10.1145/3628428