Loading…

A Comparison Of Deep Learning MOS Predictors For Speech Synthesis Quality

Speech synthesis quality prediction has made remarkable progress with the development of supervised and self-supervised learning (SSL) MOS predictors but some aspects related to the data are still unclear and require further study. In this paper, we evaluate several MOS predictors based on wav2vec 2...

Full description

Saved in:
Bibliographic Details
Main Authors: Ragano, Alessandro, Benetos, Emmanouil, Chinen, Michael, Martinez, Helard Becerra, Reddy, Chandan K A, Skoglund, Jan, Hines, Andrew
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page 6
container_issue
container_start_page 1
container_title
container_volume
creator Ragano, Alessandro
Benetos, Emmanouil
Chinen, Michael
Martinez, Helard Becerra
Reddy, Chandan K A
Skoglund, Jan
Hines, Andrew
description Speech synthesis quality prediction has made remarkable progress with the development of supervised and self-supervised learning (SSL) MOS predictors but some aspects related to the data are still unclear and require further study. In this paper, we evaluate several MOS predictors based on wav2vec 2.0 and the NISQA speech quality prediction model to explore the role of the training data, the influence of the system type, and the role of cross-domain features in SSL models. Our evaluation is based on the VoiceMOS challenge dataset. Results show that SSL-based models show the highest correlation and lowest mean squared error compared to supervised models. The key point of this study is that benchmarking the statistical performance of MOS predictors alone is not sufficient to rank models since potential issues hidden in the data could bias the evaluated performances.
doi_str_mv 10.1109/ISSC59246.2023.10162088
format conference_proceeding
fullrecord <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10162088</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10162088</ieee_id><sourcerecordid>10162088</sourcerecordid><originalsourceid>FETCH-LOGICAL-i253t-e0137dafa96da55a8ef50c61f15c2de5e8ac371f2055412548ff192a7eb1d3b23</originalsourceid><addsrcrecordid>eNo1z9FKwzAUgOEoCI65NxDMC3Sek_S0yeWoTguTKdXrkbUnLrK1JakXe3sv1Kv_7oNfiDuEJSLY-7ppKrIqL5YKlF4iYKHAmAuxsKU1mkDnQCVcipkqjMkwp_xaLFL6AgBFqJUtZqJeyWo4jS6GNPRy6-UD8yg37GIf-k_5sm3ka-QutNMQk1wPUTYjc3uQzbmfDpxCkm_f7him84248u6YePHXufhYP75Xz9lm-1RXq00WFOkpY0Bdds47W3SOyBn2BG2BHqlVHRMb1-oSvQKiHBXlxnu0ypW8x07vlZ6L2183MPNujOHk4nn3f69_AKMmThU</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>A Comparison Of Deep Learning MOS Predictors For Speech Synthesis Quality</title><source>IEEE Xplore All Conference Series</source><creator>Ragano, Alessandro ; Benetos, Emmanouil ; Chinen, Michael ; Martinez, Helard Becerra ; Reddy, Chandan K A ; Skoglund, Jan ; Hines, Andrew</creator><creatorcontrib>Ragano, Alessandro ; Benetos, Emmanouil ; Chinen, Michael ; Martinez, Helard Becerra ; Reddy, Chandan K A ; Skoglund, Jan ; Hines, Andrew</creatorcontrib><description>Speech synthesis quality prediction has made remarkable progress with the development of supervised and self-supervised learning (SSL) MOS predictors but some aspects related to the data are still unclear and require further study. In this paper, we evaluate several MOS predictors based on wav2vec 2.0 and the NISQA speech quality prediction model to explore the role of the training data, the influence of the system type, and the role of cross-domain features in SSL models. Our evaluation is based on the VoiceMOS challenge dataset. Results show that SSL-based models show the highest correlation and lowest mean squared error compared to supervised models. The key point of this study is that benchmarking the statistical performance of MOS predictors alone is not sufficient to rank models since potential issues hidden in the data could bias the evaluated performances.</description><identifier>EISSN: 2688-1454</identifier><identifier>EISBN: 9798350340570</identifier><identifier>DOI: 10.1109/ISSC59246.2023.10162088</identifier><language>eng</language><publisher>IEEE</publisher><subject>Correlation ; Deep learning ; Natural languages ; Predictive models ; Self-supervised learning ; speech quality prediction ; speech synthesis ; Training ; Training data</subject><ispartof>2023 34th Irish Signals and Systems Conference (ISSC), 2023, p.1-6</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10162088$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,777,781,786,787,27906,54536,54913</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10162088$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Ragano, Alessandro</creatorcontrib><creatorcontrib>Benetos, Emmanouil</creatorcontrib><creatorcontrib>Chinen, Michael</creatorcontrib><creatorcontrib>Martinez, Helard Becerra</creatorcontrib><creatorcontrib>Reddy, Chandan K A</creatorcontrib><creatorcontrib>Skoglund, Jan</creatorcontrib><creatorcontrib>Hines, Andrew</creatorcontrib><title>A Comparison Of Deep Learning MOS Predictors For Speech Synthesis Quality</title><title>2023 34th Irish Signals and Systems Conference (ISSC)</title><addtitle>ISSC</addtitle><description>Speech synthesis quality prediction has made remarkable progress with the development of supervised and self-supervised learning (SSL) MOS predictors but some aspects related to the data are still unclear and require further study. In this paper, we evaluate several MOS predictors based on wav2vec 2.0 and the NISQA speech quality prediction model to explore the role of the training data, the influence of the system type, and the role of cross-domain features in SSL models. Our evaluation is based on the VoiceMOS challenge dataset. Results show that SSL-based models show the highest correlation and lowest mean squared error compared to supervised models. The key point of this study is that benchmarking the statistical performance of MOS predictors alone is not sufficient to rank models since potential issues hidden in the data could bias the evaluated performances.</description><subject>Correlation</subject><subject>Deep learning</subject><subject>Natural languages</subject><subject>Predictive models</subject><subject>Self-supervised learning</subject><subject>speech quality prediction</subject><subject>speech synthesis</subject><subject>Training</subject><subject>Training data</subject><issn>2688-1454</issn><isbn>9798350340570</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2023</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNo1z9FKwzAUgOEoCI65NxDMC3Sek_S0yeWoTguTKdXrkbUnLrK1JakXe3sv1Kv_7oNfiDuEJSLY-7ppKrIqL5YKlF4iYKHAmAuxsKU1mkDnQCVcipkqjMkwp_xaLFL6AgBFqJUtZqJeyWo4jS6GNPRy6-UD8yg37GIf-k_5sm3ka-QutNMQk1wPUTYjc3uQzbmfDpxCkm_f7him84248u6YePHXufhYP75Xz9lm-1RXq00WFOkpY0Bdds47W3SOyBn2BG2BHqlVHRMb1-oSvQKiHBXlxnu0ypW8x07vlZ6L2183MPNujOHk4nn3f69_AKMmThU</recordid><startdate>20230101</startdate><enddate>20230101</enddate><creator>Ragano, Alessandro</creator><creator>Benetos, Emmanouil</creator><creator>Chinen, Michael</creator><creator>Martinez, Helard Becerra</creator><creator>Reddy, Chandan K A</creator><creator>Skoglund, Jan</creator><creator>Hines, Andrew</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>20230101</creationdate><title>A Comparison Of Deep Learning MOS Predictors For Speech Synthesis Quality</title><author>Ragano, Alessandro ; Benetos, Emmanouil ; Chinen, Michael ; Martinez, Helard Becerra ; Reddy, Chandan K A ; Skoglund, Jan ; Hines, Andrew</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i253t-e0137dafa96da55a8ef50c61f15c2de5e8ac371f2055412548ff192a7eb1d3b23</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Correlation</topic><topic>Deep learning</topic><topic>Natural languages</topic><topic>Predictive models</topic><topic>Self-supervised learning</topic><topic>speech quality prediction</topic><topic>speech synthesis</topic><topic>Training</topic><topic>Training data</topic><toplevel>online_resources</toplevel><creatorcontrib>Ragano, Alessandro</creatorcontrib><creatorcontrib>Benetos, Emmanouil</creatorcontrib><creatorcontrib>Chinen, Michael</creatorcontrib><creatorcontrib>Martinez, Helard Becerra</creatorcontrib><creatorcontrib>Reddy, Chandan K A</creatorcontrib><creatorcontrib>Skoglund, Jan</creatorcontrib><creatorcontrib>Hines, Andrew</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ragano, Alessandro</au><au>Benetos, Emmanouil</au><au>Chinen, Michael</au><au>Martinez, Helard Becerra</au><au>Reddy, Chandan K A</au><au>Skoglund, Jan</au><au>Hines, Andrew</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>A Comparison Of Deep Learning MOS Predictors For Speech Synthesis Quality</atitle><btitle>2023 34th Irish Signals and Systems Conference (ISSC)</btitle><stitle>ISSC</stitle><date>2023-01-01</date><risdate>2023</risdate><spage>1</spage><epage>6</epage><pages>1-6</pages><eissn>2688-1454</eissn><eisbn>9798350340570</eisbn><abstract>Speech synthesis quality prediction has made remarkable progress with the development of supervised and self-supervised learning (SSL) MOS predictors but some aspects related to the data are still unclear and require further study. In this paper, we evaluate several MOS predictors based on wav2vec 2.0 and the NISQA speech quality prediction model to explore the role of the training data, the influence of the system type, and the role of cross-domain features in SSL models. Our evaluation is based on the VoiceMOS challenge dataset. Results show that SSL-based models show the highest correlation and lowest mean squared error compared to supervised models. The key point of this study is that benchmarking the statistical performance of MOS predictors alone is not sufficient to rank models since potential issues hidden in the data could bias the evaluated performances.</abstract><pub>IEEE</pub><doi>10.1109/ISSC59246.2023.10162088</doi><tpages>6</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier EISSN: 2688-1454
ispartof 2023 34th Irish Signals and Systems Conference (ISSC), 2023, p.1-6
issn 2688-1454
language eng
recordid cdi_ieee_primary_10162088
source IEEE Xplore All Conference Series
subjects Correlation
Deep learning
Natural languages
Predictive models
Self-supervised learning
speech quality prediction
speech synthesis
Training
Training data
title A Comparison Of Deep Learning MOS Predictors For Speech Synthesis Quality
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-18T04%3A50%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=A%20Comparison%20Of%20Deep%20Learning%20MOS%20Predictors%20For%20Speech%20Synthesis%20Quality&rft.btitle=2023%2034th%20Irish%20Signals%20and%20Systems%20Conference%20(ISSC)&rft.au=Ragano,%20Alessandro&rft.date=2023-01-01&rft.spage=1&rft.epage=6&rft.pages=1-6&rft.eissn=2688-1454&rft_id=info:doi/10.1109/ISSC59246.2023.10162088&rft.eisbn=9798350340570&rft_dat=%3Cieee_CHZPO%3E10162088%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i253t-e0137dafa96da55a8ef50c61f15c2de5e8ac371f2055412548ff192a7eb1d3b23%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10162088&rfr_iscdi=true