Loading…
A low latency sequential model and its user-focused evaluation for automatic punctuation of ASR closed captions
•Low latency, real-time automatic punctuation model.•RNN-based punctuation outperforms the MaxEnt baseline.•Subjective tests confirm that humans prefer punctuated captions.•Deaf or hard of hearing users prefer automatic punctuation even more. In Automatic Speech Recognition (ASR), inserting the punc...
Saved in:
Published in: | Computer speech & language 2020-09, Vol.63, p.101076, Article 101076 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | •Low latency, real-time automatic punctuation model.•RNN-based punctuation outperforms the MaxEnt baseline.•Subjective tests confirm that humans prefer punctuated captions.•Deaf or hard of hearing users prefer automatic punctuation even more.
In Automatic Speech Recognition (ASR), inserting the punctuation marks into the word chain hypothesis has long been given low priority, as efforts were concentrated on minimizing word error rates. Punctuation, however, also has a high impact on the transcription quality perceived by the users. Prosody, textual context and their combination have since been used successfully for automatic punctuation of ASR outputs. The recently proposed RNN based solutions show encouraging performance. We believe that current bottlenecks of punctuation technology are on one hand the complex punctuation models, which, having high latency, are not suitable for use-cases with real-time requirements; and on the other hand, punctuation efforts have not been validated against human perception and user impression. The ambition of this paper is to propose a lightweight, yet powerful RNN punctuation model for on-line (real-time including low latency) environment, and also to assess user opinion, in general and also for target users living with hearing loss or impairment. The proposed on-line RNN punctuation model is evaluated against a Maximum Entropy (MaxEnt) baseline, for Hungarian and for English, whereas subjective assessment tests are carried out on real broadcast data subtitled with ASR (closed captioning). As it can be expected, the RNN outperforms the MaxEnt baseline system, but of course not the off-line systems: limiting the future context to minimize latency results only in a slighter performance drop, but ASR errors obviously influence punctuation performance considerably. A genre analysis is also carried out w.r.t. the punctuation performance showing that both recognition and punctuation of more spontaneous speech styles is challenging. Overall, the subjective tests confirmed that users perceive a significant quality improvement when punctuation is added, even in presence of word errors and even if punctuation is automatic and hence itself may contain further errors. For users living with hearing loss or deafness, an even higher, clear preference for the punctuated captions could be confirmed. |
---|---|
ISSN: | 0885-2308 1095-8363 |
DOI: | 10.1016/j.csl.2020.101076 |