Loading…
Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR
Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a singl...
Saved in:
Published in: | arXiv.org 2024-02 |
---|---|
Main Authors: | , , , , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | |
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Chen, Qian Wang, Wen Zhang, Qinglin Zheng, Siqi Zhang, Shiliang Deng, Chong Ma, Yukun Yu, Hai Liu, Jiaqing Zhang, Chong |
description | Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Masking strategy for the ASR task, which ignores the dependency among speech tokens. In this paper, we propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach. To address this issue, we propose a novel approach denoted Smoothed Label Distillation (SLD), which applies a KL divergence loss with smoothed labels on speech tokens. Our experiments show that SLD effectively models speech tokens and outperforms Loss Masking for decoder-only Transformers in ASR tasks with different speech discretization methods. The source code can be found here: https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld |
format | article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2887708329</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2887708329</sourcerecordid><originalsourceid>FETCH-proquest_journals_28877083293</originalsourceid><addsrcrecordid>eNqNi7sOgjAUQBsTE4nyDzdxboKtCI5GNJoog7CTChfDw1Z7y-Dfy-AHOJ3hnDNhnpByxeO1EDPmE7VBEIhNJMJQeiy7GCK4Kuoa_YAzQWocpIgVVtBoSLA0FVpudP-B3CpNtbFPtDACkoZKiw65Mx1qflc0TrvstmDTWvWE_o9ztjwe8v2Jv6x5D0iuaM1g9agKEcdRFMRSbOV_1ReZ2T8K</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2887708329</pqid></control><display><type>article</type><title>Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR</title><source>Publicly Available Content Database</source><creator>Chen, Qian ; Wang, Wen ; Zhang, Qinglin ; Zheng, Siqi ; Zhang, Shiliang ; Deng, Chong ; Ma, Yukun ; Yu, Hai ; Liu, Jiaqing ; Zhang, Chong</creator><creatorcontrib>Chen, Qian ; Wang, Wen ; Zhang, Qinglin ; Zheng, Siqi ; Zhang, Shiliang ; Deng, Chong ; Ma, Yukun ; Yu, Hai ; Liu, Jiaqing ; Zhang, Chong</creatorcontrib><description>Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Masking strategy for the ASR task, which ignores the dependency among speech tokens. In this paper, we propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach. To address this issue, we propose a novel approach denoted Smoothed Label Distillation (SLD), which applies a KL divergence loss with smoothed labels on speech tokens. Our experiments show that SLD effectively models speech tokens and outperforms Loss Masking for decoder-only Transformers in ASR tasks with different speech discretization methods. The source code can be found here: https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Discretization ; Distillation ; Entropy ; Labels ; Masking ; Speech ; Transformers</subject><ispartof>arXiv.org, 2024-02</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2887708329?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>776,780,25731,36989,44566</link.rule.ids></links><search><creatorcontrib>Chen, Qian</creatorcontrib><creatorcontrib>Wang, Wen</creatorcontrib><creatorcontrib>Zhang, Qinglin</creatorcontrib><creatorcontrib>Zheng, Siqi</creatorcontrib><creatorcontrib>Zhang, Shiliang</creatorcontrib><creatorcontrib>Deng, Chong</creatorcontrib><creatorcontrib>Ma, Yukun</creatorcontrib><creatorcontrib>Yu, Hai</creatorcontrib><creatorcontrib>Liu, Jiaqing</creatorcontrib><creatorcontrib>Zhang, Chong</creatorcontrib><title>Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR</title><title>arXiv.org</title><description>Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Masking strategy for the ASR task, which ignores the dependency among speech tokens. In this paper, we propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach. To address this issue, we propose a novel approach denoted Smoothed Label Distillation (SLD), which applies a KL divergence loss with smoothed labels on speech tokens. Our experiments show that SLD effectively models speech tokens and outperforms Loss Masking for decoder-only Transformers in ASR tasks with different speech discretization methods. The source code can be found here: https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld</description><subject>Discretization</subject><subject>Distillation</subject><subject>Entropy</subject><subject>Labels</subject><subject>Masking</subject><subject>Speech</subject><subject>Transformers</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNi7sOgjAUQBsTE4nyDzdxboKtCI5GNJoog7CTChfDw1Z7y-Dfy-AHOJ3hnDNhnpByxeO1EDPmE7VBEIhNJMJQeiy7GCK4Kuoa_YAzQWocpIgVVtBoSLA0FVpudP-B3CpNtbFPtDACkoZKiw65Mx1qflc0TrvstmDTWvWE_o9ztjwe8v2Jv6x5D0iuaM1g9agKEcdRFMRSbOV_1ReZ2T8K</recordid><startdate>20240205</startdate><enddate>20240205</enddate><creator>Chen, Qian</creator><creator>Wang, Wen</creator><creator>Zhang, Qinglin</creator><creator>Zheng, Siqi</creator><creator>Zhang, Shiliang</creator><creator>Deng, Chong</creator><creator>Ma, Yukun</creator><creator>Yu, Hai</creator><creator>Liu, Jiaqing</creator><creator>Zhang, Chong</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240205</creationdate><title>Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR</title><author>Chen, Qian ; Wang, Wen ; Zhang, Qinglin ; Zheng, Siqi ; Zhang, Shiliang ; Deng, Chong ; Ma, Yukun ; Yu, Hai ; Liu, Jiaqing ; Zhang, Chong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28877083293</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Discretization</topic><topic>Distillation</topic><topic>Entropy</topic><topic>Labels</topic><topic>Masking</topic><topic>Speech</topic><topic>Transformers</topic><toplevel>online_resources</toplevel><creatorcontrib>Chen, Qian</creatorcontrib><creatorcontrib>Wang, Wen</creatorcontrib><creatorcontrib>Zhang, Qinglin</creatorcontrib><creatorcontrib>Zheng, Siqi</creatorcontrib><creatorcontrib>Zhang, Shiliang</creatorcontrib><creatorcontrib>Deng, Chong</creatorcontrib><creatorcontrib>Ma, Yukun</creatorcontrib><creatorcontrib>Yu, Hai</creatorcontrib><creatorcontrib>Liu, Jiaqing</creatorcontrib><creatorcontrib>Zhang, Chong</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Chen, Qian</au><au>Wang, Wen</au><au>Zhang, Qinglin</au><au>Zheng, Siqi</au><au>Zhang, Shiliang</au><au>Deng, Chong</au><au>Ma, Yukun</au><au>Yu, Hai</au><au>Liu, Jiaqing</au><au>Zhang, Chong</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR</atitle><jtitle>arXiv.org</jtitle><date>2024-02-05</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Masking strategy for the ASR task, which ignores the dependency among speech tokens. In this paper, we propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach. To address this issue, we propose a novel approach denoted Smoothed Label Distillation (SLD), which applies a KL divergence loss with smoothed labels on speech tokens. Our experiments show that SLD effectively models speech tokens and outperforms Loss Masking for decoder-only Transformers in ASR tasks with different speech discretization methods. The source code can be found here: https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-02 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2887708329 |
source | Publicly Available Content Database |
subjects | Discretization Distillation Entropy Labels Masking Speech Transformers |
title | Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-12T21%3A24%3A20IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Loss%20Masking%20Is%20Not%20Needed%20in%20Decoder-only%20Transformer%20for%20Discrete-token-based%20ASR&rft.jtitle=arXiv.org&rft.au=Chen,%20Qian&rft.date=2024-02-05&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2887708329%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_28877083293%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2887708329&rft_id=info:pmid/&rfr_iscdi=true |