Loading…
Speech quality estimation with deep lattice networks
Intrusive subjective speech quality estimation of mean opinion score (MOS) often involves mapping a raw similarity score extracted from differences between the clean and degraded utterance onto MOS with a fitted mapping function. More recent models such as support vector regression (SVR) or deep neu...
Saved in:
Published in: | The Journal of the Acoustical Society of America 2021-06, Vol.149 (6), p.3851-3861 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Intrusive subjective speech quality estimation of mean opinion score (MOS) often involves mapping a raw similarity score extracted from differences between the clean and degraded utterance onto MOS with a fitted mapping function. More recent models such as support vector regression (SVR) or deep neural networks use multidimensional input, which allows for a more accurate prediction than one-dimensional (1-D) mappings but does not provide the monotonic property that is expected between similarity and quality. We investigate a multidimensional mapping function using deep lattice networks (DLNs) to provide monotonic constraints with input features provided by ViSQOL. The DLN improved the speech mapping to 0.24 mean-square error on a mixture of datasets that include voice over IP and codec degradations, outperforming the 1-D fitted functions and SVR as well as PESQ and POLQA. Additionally, we show that the DLN can be used to learn a quantile function that is well-calibrated and a useful measure of uncertainty. The quantile function provides an improved mapping of data driven similarity representations to human interpretable scales, such as quantile intervals for predictions instead of point estimates. |
---|---|
ISSN: | 0001-4966 1520-8524 |
DOI: | 10.1121/10.0005130 |