Loading…
A Comparison Of Deep Learning MOS Predictors For Speech Synthesis Quality
Speech synthesis quality prediction has made remarkable progress with the development of supervised and self-supervised learning (SSL) MOS predictors but some aspects related to the data are still unclear and require further study. In this paper, we evaluate several MOS predictors based on wav2vec 2...
Saved in:
Main Authors: | , , , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | 6 |
container_issue | |
container_start_page | 1 |
container_title | |
container_volume | |
creator | Ragano, Alessandro Benetos, Emmanouil Chinen, Michael Martinez, Helard Becerra Reddy, Chandan K A Skoglund, Jan Hines, Andrew |
description | Speech synthesis quality prediction has made remarkable progress with the development of supervised and self-supervised learning (SSL) MOS predictors but some aspects related to the data are still unclear and require further study. In this paper, we evaluate several MOS predictors based on wav2vec 2.0 and the NISQA speech quality prediction model to explore the role of the training data, the influence of the system type, and the role of cross-domain features in SSL models. Our evaluation is based on the VoiceMOS challenge dataset. Results show that SSL-based models show the highest correlation and lowest mean squared error compared to supervised models. The key point of this study is that benchmarking the statistical performance of MOS predictors alone is not sufficient to rank models since potential issues hidden in the data could bias the evaluated performances. |
doi_str_mv | 10.1109/ISSC59246.2023.10162088 |
format | conference_proceeding |
fullrecord | <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10162088</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10162088</ieee_id><sourcerecordid>10162088</sourcerecordid><originalsourceid>FETCH-LOGICAL-i253t-e0137dafa96da55a8ef50c61f15c2de5e8ac371f2055412548ff192a7eb1d3b23</originalsourceid><addsrcrecordid>eNo1z9FKwzAUgOEoCI65NxDMC3Sek_S0yeWoTguTKdXrkbUnLrK1JakXe3sv1Kv_7oNfiDuEJSLY-7ppKrIqL5YKlF4iYKHAmAuxsKU1mkDnQCVcipkqjMkwp_xaLFL6AgBFqJUtZqJeyWo4jS6GNPRy6-UD8yg37GIf-k_5sm3ka-QutNMQk1wPUTYjc3uQzbmfDpxCkm_f7him84248u6YePHXufhYP75Xz9lm-1RXq00WFOkpY0Bdds47W3SOyBn2BG2BHqlVHRMb1-oSvQKiHBXlxnu0ypW8x07vlZ6L2183MPNujOHk4nn3f69_AKMmThU</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>A Comparison Of Deep Learning MOS Predictors For Speech Synthesis Quality</title><source>IEEE Xplore All Conference Series</source><creator>Ragano, Alessandro ; Benetos, Emmanouil ; Chinen, Michael ; Martinez, Helard Becerra ; Reddy, Chandan K A ; Skoglund, Jan ; Hines, Andrew</creator><creatorcontrib>Ragano, Alessandro ; Benetos, Emmanouil ; Chinen, Michael ; Martinez, Helard Becerra ; Reddy, Chandan K A ; Skoglund, Jan ; Hines, Andrew</creatorcontrib><description>Speech synthesis quality prediction has made remarkable progress with the development of supervised and self-supervised learning (SSL) MOS predictors but some aspects related to the data are still unclear and require further study. In this paper, we evaluate several MOS predictors based on wav2vec 2.0 and the NISQA speech quality prediction model to explore the role of the training data, the influence of the system type, and the role of cross-domain features in SSL models. Our evaluation is based on the VoiceMOS challenge dataset. Results show that SSL-based models show the highest correlation and lowest mean squared error compared to supervised models. The key point of this study is that benchmarking the statistical performance of MOS predictors alone is not sufficient to rank models since potential issues hidden in the data could bias the evaluated performances.</description><identifier>EISSN: 2688-1454</identifier><identifier>EISBN: 9798350340570</identifier><identifier>DOI: 10.1109/ISSC59246.2023.10162088</identifier><language>eng</language><publisher>IEEE</publisher><subject>Correlation ; Deep learning ; Natural languages ; Predictive models ; Self-supervised learning ; speech quality prediction ; speech synthesis ; Training ; Training data</subject><ispartof>2023 34th Irish Signals and Systems Conference (ISSC), 2023, p.1-6</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10162088$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,777,781,786,787,27906,54536,54913</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10162088$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Ragano, Alessandro</creatorcontrib><creatorcontrib>Benetos, Emmanouil</creatorcontrib><creatorcontrib>Chinen, Michael</creatorcontrib><creatorcontrib>Martinez, Helard Becerra</creatorcontrib><creatorcontrib>Reddy, Chandan K A</creatorcontrib><creatorcontrib>Skoglund, Jan</creatorcontrib><creatorcontrib>Hines, Andrew</creatorcontrib><title>A Comparison Of Deep Learning MOS Predictors For Speech Synthesis Quality</title><title>2023 34th Irish Signals and Systems Conference (ISSC)</title><addtitle>ISSC</addtitle><description>Speech synthesis quality prediction has made remarkable progress with the development of supervised and self-supervised learning (SSL) MOS predictors but some aspects related to the data are still unclear and require further study. In this paper, we evaluate several MOS predictors based on wav2vec 2.0 and the NISQA speech quality prediction model to explore the role of the training data, the influence of the system type, and the role of cross-domain features in SSL models. Our evaluation is based on the VoiceMOS challenge dataset. Results show that SSL-based models show the highest correlation and lowest mean squared error compared to supervised models. The key point of this study is that benchmarking the statistical performance of MOS predictors alone is not sufficient to rank models since potential issues hidden in the data could bias the evaluated performances.</description><subject>Correlation</subject><subject>Deep learning</subject><subject>Natural languages</subject><subject>Predictive models</subject><subject>Self-supervised learning</subject><subject>speech quality prediction</subject><subject>speech synthesis</subject><subject>Training</subject><subject>Training data</subject><issn>2688-1454</issn><isbn>9798350340570</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2023</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNo1z9FKwzAUgOEoCI65NxDMC3Sek_S0yeWoTguTKdXrkbUnLrK1JakXe3sv1Kv_7oNfiDuEJSLY-7ppKrIqL5YKlF4iYKHAmAuxsKU1mkDnQCVcipkqjMkwp_xaLFL6AgBFqJUtZqJeyWo4jS6GNPRy6-UD8yg37GIf-k_5sm3ka-QutNMQk1wPUTYjc3uQzbmfDpxCkm_f7him84248u6YePHXufhYP75Xz9lm-1RXq00WFOkpY0Bdds47W3SOyBn2BG2BHqlVHRMb1-oSvQKiHBXlxnu0ypW8x07vlZ6L2183MPNujOHk4nn3f69_AKMmThU</recordid><startdate>20230101</startdate><enddate>20230101</enddate><creator>Ragano, Alessandro</creator><creator>Benetos, Emmanouil</creator><creator>Chinen, Michael</creator><creator>Martinez, Helard Becerra</creator><creator>Reddy, Chandan K A</creator><creator>Skoglund, Jan</creator><creator>Hines, Andrew</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>20230101</creationdate><title>A Comparison Of Deep Learning MOS Predictors For Speech Synthesis Quality</title><author>Ragano, Alessandro ; Benetos, Emmanouil ; Chinen, Michael ; Martinez, Helard Becerra ; Reddy, Chandan K A ; Skoglund, Jan ; Hines, Andrew</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i253t-e0137dafa96da55a8ef50c61f15c2de5e8ac371f2055412548ff192a7eb1d3b23</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Correlation</topic><topic>Deep learning</topic><topic>Natural languages</topic><topic>Predictive models</topic><topic>Self-supervised learning</topic><topic>speech quality prediction</topic><topic>speech synthesis</topic><topic>Training</topic><topic>Training data</topic><toplevel>online_resources</toplevel><creatorcontrib>Ragano, Alessandro</creatorcontrib><creatorcontrib>Benetos, Emmanouil</creatorcontrib><creatorcontrib>Chinen, Michael</creatorcontrib><creatorcontrib>Martinez, Helard Becerra</creatorcontrib><creatorcontrib>Reddy, Chandan K A</creatorcontrib><creatorcontrib>Skoglund, Jan</creatorcontrib><creatorcontrib>Hines, Andrew</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ragano, Alessandro</au><au>Benetos, Emmanouil</au><au>Chinen, Michael</au><au>Martinez, Helard Becerra</au><au>Reddy, Chandan K A</au><au>Skoglund, Jan</au><au>Hines, Andrew</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>A Comparison Of Deep Learning MOS Predictors For Speech Synthesis Quality</atitle><btitle>2023 34th Irish Signals and Systems Conference (ISSC)</btitle><stitle>ISSC</stitle><date>2023-01-01</date><risdate>2023</risdate><spage>1</spage><epage>6</epage><pages>1-6</pages><eissn>2688-1454</eissn><eisbn>9798350340570</eisbn><abstract>Speech synthesis quality prediction has made remarkable progress with the development of supervised and self-supervised learning (SSL) MOS predictors but some aspects related to the data are still unclear and require further study. In this paper, we evaluate several MOS predictors based on wav2vec 2.0 and the NISQA speech quality prediction model to explore the role of the training data, the influence of the system type, and the role of cross-domain features in SSL models. Our evaluation is based on the VoiceMOS challenge dataset. Results show that SSL-based models show the highest correlation and lowest mean squared error compared to supervised models. The key point of this study is that benchmarking the statistical performance of MOS predictors alone is not sufficient to rank models since potential issues hidden in the data could bias the evaluated performances.</abstract><pub>IEEE</pub><doi>10.1109/ISSC59246.2023.10162088</doi><tpages>6</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | EISSN: 2688-1454 |
ispartof | 2023 34th Irish Signals and Systems Conference (ISSC), 2023, p.1-6 |
issn | 2688-1454 |
language | eng |
recordid | cdi_ieee_primary_10162088 |
source | IEEE Xplore All Conference Series |
subjects | Correlation Deep learning Natural languages Predictive models Self-supervised learning speech quality prediction speech synthesis Training Training data |
title | A Comparison Of Deep Learning MOS Predictors For Speech Synthesis Quality |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-18T04%3A50%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=A%20Comparison%20Of%20Deep%20Learning%20MOS%20Predictors%20For%20Speech%20Synthesis%20Quality&rft.btitle=2023%2034th%20Irish%20Signals%20and%20Systems%20Conference%20(ISSC)&rft.au=Ragano,%20Alessandro&rft.date=2023-01-01&rft.spage=1&rft.epage=6&rft.pages=1-6&rft.eissn=2688-1454&rft_id=info:doi/10.1109/ISSC59246.2023.10162088&rft.eisbn=9798350340570&rft_dat=%3Cieee_CHZPO%3E10162088%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i253t-e0137dafa96da55a8ef50c61f15c2de5e8ac371f2055412548ff192a7eb1d3b23%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10162088&rfr_iscdi=true |