Loading…

Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR

Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a singl...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2024-02
Main Authors:	Chen, Qian, Wang, Wen, Zhang, Qinglin, Zheng, Siqi, Zhang, Shiliang, Deng, Chong, Ma, Yukun, Yu, Hai, Liu, Jiaqing, Zhang, Chong
Format:	Article
Language:	English
Subjects:	Discretization Distillation Entropy Labels Masking Speech Transformers
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Chen, Qian Wang, Wen Zhang, Qinglin Zheng, Siqi Zhang, Shiliang Deng, Chong Ma, Yukun Yu, Hai Liu, Jiaqing Zhang, Chong
description	Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Masking strategy for the ASR task, which ignores the dependency among speech tokens. In this paper, we propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach. To address this issue, we propose a novel approach denoted Smoothed Label Distillation (SLD), which applies a KL divergence loss with smoothed labels on speech tokens. Our experiments show that SLD effectively models speech tokens and outperforms Loss Masking for decoder-only Transformers in ASR tasks with different speech discretization methods. The source code can be found here: https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld
format	article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2887708329</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2887708329</sourcerecordid><originalsourceid>FETCH-proquest_journals_28877083293</originalsourceid><addsrcrecordid>eNqNi7sOgjAUQBsTE4nyDzdxboKtCI5GNJoog7CTChfDw1Z7y-Dfy-AHOJ3hnDNhnpByxeO1EDPmE7VBEIhNJMJQeiy7GCK4Kuoa_YAzQWocpIgVVtBoSLA0FVpudP-B3CpNtbFPtDACkoZKiw65Mx1qflc0TrvstmDTWvWE_o9ztjwe8v2Jv6x5D0iuaM1g9agKEcdRFMRSbOV_1ReZ2T8K</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2887708329</pqid></control><display><type>article</type><title>Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR</title><source>Publicly Available Content Database</source><creator>Chen, Qian ; Wang, Wen ; Zhang, Qinglin ; Zheng, Siqi ; Zhang, Shiliang ; Deng, Chong ; Ma, Yukun ; Yu, Hai ; Liu, Jiaqing ; Zhang, Chong</creator><creatorcontrib>Chen, Qian ; Wang, Wen ; Zhang, Qinglin ; Zheng, Siqi ; Zhang, Shiliang ; Deng, Chong ; Ma, Yukun ; Yu, Hai ; Liu, Jiaqing ; Zhang, Chong</creatorcontrib><description>Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Masking strategy for the ASR task, which ignores the dependency among speech tokens. In this paper, we propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach. To address this issue, we propose a novel approach denoted Smoothed Label Distillation (SLD), which applies a KL divergence loss with smoothed labels on speech tokens. Our experiments show that SLD effectively models speech tokens and outperforms Loss Masking for decoder-only Transformers in ASR tasks with different speech discretization methods. The source code can be found here: https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Discretization ; Distillation ; Entropy ; Labels ; Masking ; Speech ; Transformers</subject><ispartof>arXiv.org, 2024-02</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2887708329?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>776,780,25731,36989,44566</link.rule.ids></links><search><creatorcontrib>Chen, Qian</creatorcontrib><creatorcontrib>Wang, Wen</creatorcontrib><creatorcontrib>Zhang, Qinglin</creatorcontrib><creatorcontrib>Zheng, Siqi</creatorcontrib><creatorcontrib>Zhang, Shiliang</creatorcontrib><creatorcontrib>Deng, Chong</creatorcontrib><creatorcontrib>Ma, Yukun</creatorcontrib><creatorcontrib>Yu, Hai</creatorcontrib><creatorcontrib>Liu, Jiaqing</creatorcontrib><creatorcontrib>Zhang, Chong</creatorcontrib><title>Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR</title><title>arXiv.org</title><description>Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Masking strategy for the ASR task, which ignores the dependency among speech tokens. In this paper, we propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach. To address this issue, we propose a novel approach denoted Smoothed Label Distillation (SLD), which applies a KL divergence loss with smoothed labels on speech tokens. Our experiments show that SLD effectively models speech tokens and outperforms Loss Masking for decoder-only Transformers in ASR tasks with different speech discretization methods. The source code can be found here: https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld</description><subject>Discretization</subject><subject>Distillation</subject><subject>Entropy</subject><subject>Labels</subject><subject>Masking</subject><subject>Speech</subject><subject>Transformers</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNi7sOgjAUQBsTE4nyDzdxboKtCI5GNJoog7CTChfDw1Z7y-Dfy-AHOJ3hnDNhnpByxeO1EDPmE7VBEIhNJMJQeiy7GCK4Kuoa_YAzQWocpIgVVtBoSLA0FVpudP-B3CpNtbFPtDACkoZKiw65Mx1qflc0TrvstmDTWvWE_o9ztjwe8v2Jv6x5D0iuaM1g9agKEcdRFMRSbOV_1ReZ2T8K</recordid><startdate>20240205</startdate><enddate>20240205</enddate><creator>Chen, Qian</creator><creator>Wang, Wen</creator><creator>Zhang, Qinglin</creator><creator>Zheng, Siqi</creator><creator>Zhang, Shiliang</creator><creator>Deng, Chong</creator><creator>Ma, Yukun</creator><creator>Yu, Hai</creator><creator>Liu, Jiaqing</creator><creator>Zhang, Chong</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240205</creationdate><title>Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR</title><author>Chen, Qian ; Wang, Wen ; Zhang, Qinglin ; Zheng, Siqi ; Zhang, Shiliang ; Deng, Chong ; Ma, Yukun ; Yu, Hai ; Liu, Jiaqing ; Zhang, Chong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28877083293</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Discretization</topic><topic>Distillation</topic><topic>Entropy</topic><topic>Labels</topic><topic>Masking</topic><topic>Speech</topic><topic>Transformers</topic><toplevel>online_resources</toplevel><creatorcontrib>Chen, Qian</creatorcontrib><creatorcontrib>Wang, Wen</creatorcontrib><creatorcontrib>Zhang, Qinglin</creatorcontrib><creatorcontrib>Zheng, Siqi</creatorcontrib><creatorcontrib>Zhang, Shiliang</creatorcontrib><creatorcontrib>Deng, Chong</creatorcontrib><creatorcontrib>Ma, Yukun</creatorcontrib><creatorcontrib>Yu, Hai</creatorcontrib><creatorcontrib>Liu, Jiaqing</creatorcontrib><creatorcontrib>Zhang, Chong</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Chen, Qian</au><au>Wang, Wen</au><au>Zhang, Qinglin</au><au>Zheng, Siqi</au><au>Zhang, Shiliang</au><au>Deng, Chong</au><au>Ma, Yukun</au><au>Yu, Hai</au><au>Liu, Jiaqing</au><au>Zhang, Chong</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR</atitle><jtitle>arXiv.org</jtitle><date>2024-02-05</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Masking strategy for the ASR task, which ignores the dependency among speech tokens. In this paper, we propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach. To address this issue, we propose a novel approach denoted Smoothed Label Distillation (SLD), which applies a KL divergence loss with smoothed labels on speech tokens. Our experiments show that SLD effectively models speech tokens and outperforms Loss Masking for decoder-only Transformers in ASR tasks with different speech discretization methods. The source code can be found here: https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-02
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2887708329
source	Publicly Available Content Database
subjects	Discretization Distillation Entropy Labels Masking Speech Transformers
title	Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-12T21%3A24%3A20IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Loss%20Masking%20Is%20Not%20Needed%20in%20Decoder-only%20Transformer%20for%20Discrete-token-based%20ASR&rft.jtitle=arXiv.org&rft.au=Chen,%20Qian&rft.date=2024-02-05&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2887708329%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_28877083293%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2887708329&rft_id=info:pmid/&rfr_iscdi=true