Loading…

Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR

Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a singl...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2024-02
Main Authors: Chen, Qian, Wang, Wen, Zhang, Qinglin, Zheng, Siqi, Zhang, Shiliang, Deng, Chong, Ma, Yukun, Yu, Hai, Liu, Jiaqing, Zhang, Chong
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Chen, Qian
Wang, Wen
Zhang, Qinglin
Zheng, Siqi
Zhang, Shiliang
Deng, Chong
Ma, Yukun
Yu, Hai
Liu, Jiaqing
Zhang, Chong
description Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Masking strategy for the ASR task, which ignores the dependency among speech tokens. In this paper, we propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach. To address this issue, we propose a novel approach denoted Smoothed Label Distillation (SLD), which applies a KL divergence loss with smoothed labels on speech tokens. Our experiments show that SLD effectively models speech tokens and outperforms Loss Masking for decoder-only Transformers in ASR tasks with different speech discretization methods. The source code can be found here: https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2887708329</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2887708329</sourcerecordid><originalsourceid>FETCH-proquest_journals_28877083293</originalsourceid><addsrcrecordid>eNqNi7sOgjAUQBsTE4nyDzdxboKtCI5GNJoog7CTChfDw1Z7y-Dfy-AHOJ3hnDNhnpByxeO1EDPmE7VBEIhNJMJQeiy7GCK4Kuoa_YAzQWocpIgVVtBoSLA0FVpudP-B3CpNtbFPtDACkoZKiw65Mx1qflc0TrvstmDTWvWE_o9ztjwe8v2Jv6x5D0iuaM1g9agKEcdRFMRSbOV_1ReZ2T8K</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2887708329</pqid></control><display><type>article</type><title>Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR</title><source>Publicly Available Content Database</source><creator>Chen, Qian ; Wang, Wen ; Zhang, Qinglin ; Zheng, Siqi ; Zhang, Shiliang ; Deng, Chong ; Ma, Yukun ; Yu, Hai ; Liu, Jiaqing ; Zhang, Chong</creator><creatorcontrib>Chen, Qian ; Wang, Wen ; Zhang, Qinglin ; Zheng, Siqi ; Zhang, Shiliang ; Deng, Chong ; Ma, Yukun ; Yu, Hai ; Liu, Jiaqing ; Zhang, Chong</creatorcontrib><description>Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Masking strategy for the ASR task, which ignores the dependency among speech tokens. In this paper, we propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach. To address this issue, we propose a novel approach denoted Smoothed Label Distillation (SLD), which applies a KL divergence loss with smoothed labels on speech tokens. Our experiments show that SLD effectively models speech tokens and outperforms Loss Masking for decoder-only Transformers in ASR tasks with different speech discretization methods. The source code can be found here: https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Discretization ; Distillation ; Entropy ; Labels ; Masking ; Speech ; Transformers</subject><ispartof>arXiv.org, 2024-02</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2887708329?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>776,780,25731,36989,44566</link.rule.ids></links><search><creatorcontrib>Chen, Qian</creatorcontrib><creatorcontrib>Wang, Wen</creatorcontrib><creatorcontrib>Zhang, Qinglin</creatorcontrib><creatorcontrib>Zheng, Siqi</creatorcontrib><creatorcontrib>Zhang, Shiliang</creatorcontrib><creatorcontrib>Deng, Chong</creatorcontrib><creatorcontrib>Ma, Yukun</creatorcontrib><creatorcontrib>Yu, Hai</creatorcontrib><creatorcontrib>Liu, Jiaqing</creatorcontrib><creatorcontrib>Zhang, Chong</creatorcontrib><title>Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR</title><title>arXiv.org</title><description>Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Masking strategy for the ASR task, which ignores the dependency among speech tokens. In this paper, we propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach. To address this issue, we propose a novel approach denoted Smoothed Label Distillation (SLD), which applies a KL divergence loss with smoothed labels on speech tokens. Our experiments show that SLD effectively models speech tokens and outperforms Loss Masking for decoder-only Transformers in ASR tasks with different speech discretization methods. The source code can be found here: https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld</description><subject>Discretization</subject><subject>Distillation</subject><subject>Entropy</subject><subject>Labels</subject><subject>Masking</subject><subject>Speech</subject><subject>Transformers</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNi7sOgjAUQBsTE4nyDzdxboKtCI5GNJoog7CTChfDw1Z7y-Dfy-AHOJ3hnDNhnpByxeO1EDPmE7VBEIhNJMJQeiy7GCK4Kuoa_YAzQWocpIgVVtBoSLA0FVpudP-B3CpNtbFPtDACkoZKiw65Mx1qflc0TrvstmDTWvWE_o9ztjwe8v2Jv6x5D0iuaM1g9agKEcdRFMRSbOV_1ReZ2T8K</recordid><startdate>20240205</startdate><enddate>20240205</enddate><creator>Chen, Qian</creator><creator>Wang, Wen</creator><creator>Zhang, Qinglin</creator><creator>Zheng, Siqi</creator><creator>Zhang, Shiliang</creator><creator>Deng, Chong</creator><creator>Ma, Yukun</creator><creator>Yu, Hai</creator><creator>Liu, Jiaqing</creator><creator>Zhang, Chong</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240205</creationdate><title>Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR</title><author>Chen, Qian ; Wang, Wen ; Zhang, Qinglin ; Zheng, Siqi ; Zhang, Shiliang ; Deng, Chong ; Ma, Yukun ; Yu, Hai ; Liu, Jiaqing ; Zhang, Chong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28877083293</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Discretization</topic><topic>Distillation</topic><topic>Entropy</topic><topic>Labels</topic><topic>Masking</topic><topic>Speech</topic><topic>Transformers</topic><toplevel>online_resources</toplevel><creatorcontrib>Chen, Qian</creatorcontrib><creatorcontrib>Wang, Wen</creatorcontrib><creatorcontrib>Zhang, Qinglin</creatorcontrib><creatorcontrib>Zheng, Siqi</creatorcontrib><creatorcontrib>Zhang, Shiliang</creatorcontrib><creatorcontrib>Deng, Chong</creatorcontrib><creatorcontrib>Ma, Yukun</creatorcontrib><creatorcontrib>Yu, Hai</creatorcontrib><creatorcontrib>Liu, Jiaqing</creatorcontrib><creatorcontrib>Zhang, Chong</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Chen, Qian</au><au>Wang, Wen</au><au>Zhang, Qinglin</au><au>Zheng, Siqi</au><au>Zhang, Shiliang</au><au>Deng, Chong</au><au>Ma, Yukun</au><au>Yu, Hai</au><au>Liu, Jiaqing</au><au>Zhang, Chong</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR</atitle><jtitle>arXiv.org</jtitle><date>2024-02-05</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Masking strategy for the ASR task, which ignores the dependency among speech tokens. In this paper, we propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach. To address this issue, we propose a novel approach denoted Smoothed Label Distillation (SLD), which applies a KL divergence loss with smoothed labels on speech tokens. Our experiments show that SLD effectively models speech tokens and outperforms Loss Masking for decoder-only Transformers in ASR tasks with different speech discretization methods. The source code can be found here: https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-02
issn 2331-8422
language eng
recordid cdi_proquest_journals_2887708329
source Publicly Available Content Database
subjects Discretization
Distillation
Entropy
Labels
Masking
Speech
Transformers
title Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-12T21%3A24%3A20IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Loss%20Masking%20Is%20Not%20Needed%20in%20Decoder-only%20Transformer%20for%20Discrete-token-based%20ASR&rft.jtitle=arXiv.org&rft.au=Chen,%20Qian&rft.date=2024-02-05&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2887708329%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_28877083293%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2887708329&rft_id=info:pmid/&rfr_iscdi=true