Loading…

Reducing the GAP Between Streaming and Non-Streaming Transducer-Based ASR by Adaptive Two-Stage Knowledge Distillation

Transducer is one of the mainstream frameworks for streaming speech recognition. There is a performance gap between the streaming and non-streaming transducer models due to limited context. To reduce this gap, an effective way is to ensure that their hidden and output distributions are consistent, w...

Full description

Saved in:

Bibliographic Details
Main Authors:	Tang, Haitao, Fu, Yu, Sun, Lei, Xue, Jiabin, Liu, Dan, Li, Yongchao, Ma, Zhiqiang, Wu, Minghui, Pan, Jia, Wan, Genshun, Zhao, Ming'En
Format:	Conference Proceeding
Language:	English
Subjects:	Adaptation models Conformer Transducer Error analysis Knowledge Distillation Mean square error methods Power Transformation Signal processing Speech recognition Temperature distribution Transducers
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page	5
container_issue
container_start_page	1
container_title
container_volume
creator	Tang, Haitao Fu, Yu Sun, Lei Xue, Jiabin Liu, Dan Li, Yongchao Ma, Zhiqiang Wu, Minghui Pan, Jia Wan, Genshun Zhao, Ming'En
description	Transducer is one of the mainstream frameworks for streaming speech recognition. There is a performance gap between the streaming and non-streaming transducer models due to limited context. To reduce this gap, an effective way is to ensure that their hidden and output distributions are consistent, which can be achieved by hierarchical knowledge distillation. However, it is difficult to ensure the distribution consistency simultaneously because the learning of the output distribution depends on the hidden one. In this paper, we propose an adaptive two-stage knowledge distillation method consisting of hidden layer learning and output layer learning. In the former stage, we learn hidden representation with full context by applying mean square error loss function. In the latter stage, we design a power transformation based adaptive smoothness method to learn stable output distribution. It achieved 19% relative reduction in word error rate, and a faster response for the first token compared with the original streaming model in LibriSpeech corpus.
doi_str_mv	10.1109/ICASSP49357.2023.10095040
format	conference_proceeding
fullrecord	<record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10095040</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10095040</ieee_id><sourcerecordid>10095040</sourcerecordid><originalsourceid>FETCH-LOGICAL-i1700-b04a36eca9b7cfa5be2b2c4d03ac52f03414f02733617708d0f3b3f7763c64cf3</originalsourceid><addsrcrecordid>eNpFUMtOwzAQNEhItIU_4GA-IGX9SJwc0wIFUUHVFIlb5djrYpQ6VRJa9e8JAsRpRjOzK80Qcs1gzBhkN4_TvCgWMhOxGnPgYswAshgknJAhUzxlieBKnZIBFyqLWAZv52TYth8AkCqZDsh-ifbT-LCh3TvSWb6gE-wOiIEWXYN6--3oYOlzHaJ_ZdXo0PZ32EQT3aKlebGk5ZHmVu86v0e6OtR9XG-QPoX6UKHt2a1vO19VuvN1uCBnTlctXv7iiLze362mD9H8ZdZ3mkeeKYCoBKlFgkZnpTJOxyXykhtpQWgTcwdCMumAKyESphSkFpwohVMqESaRxokRufr56xFxvWv8VjfH9d9I4gtIBl2G</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Reducing the GAP Between Streaming and Non-Streaming Transducer-Based ASR by Adaptive Two-Stage Knowledge Distillation</title><source>IEEE Xplore All Conference Series</source><creator>Tang, Haitao ; Fu, Yu ; Sun, Lei ; Xue, Jiabin ; Liu, Dan ; Li, Yongchao ; Ma, Zhiqiang ; Wu, Minghui ; Pan, Jia ; Wan, Genshun ; Zhao, Ming'En</creator><creatorcontrib>Tang, Haitao ; Fu, Yu ; Sun, Lei ; Xue, Jiabin ; Liu, Dan ; Li, Yongchao ; Ma, Zhiqiang ; Wu, Minghui ; Pan, Jia ; Wan, Genshun ; Zhao, Ming'En</creatorcontrib><description>Transducer is one of the mainstream frameworks for streaming speech recognition. There is a performance gap between the streaming and non-streaming transducer models due to limited context. To reduce this gap, an effective way is to ensure that their hidden and output distributions are consistent, which can be achieved by hierarchical knowledge distillation. However, it is difficult to ensure the distribution consistency simultaneously because the learning of the output distribution depends on the hidden one. In this paper, we propose an adaptive two-stage knowledge distillation method consisting of hidden layer learning and output layer learning. In the former stage, we learn hidden representation with full context by applying mean square error loss function. In the latter stage, we design a power transformation based adaptive smoothness method to learn stable output distribution. It achieved 19% relative reduction in word error rate, and a faster response for the first token compared with the original streaming model in LibriSpeech corpus.</description><identifier>EISSN: 2379-190X</identifier><identifier>EISBN: 1728163277</identifier><identifier>EISBN: 9781728163277</identifier><identifier>DOI: 10.1109/ICASSP49357.2023.10095040</identifier><language>eng</language><publisher>IEEE</publisher><subject>Adaptation models ; Conformer Transducer ; Error analysis ; Knowledge Distillation ; Mean square error methods ; Power Transformation ; Signal processing ; Speech recognition ; Temperature distribution ; Transducers</subject><ispartof>ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, p.1-5</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10095040$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,23909,23910,25118,27902,54530,54907</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10095040$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Tang, Haitao</creatorcontrib><creatorcontrib>Fu, Yu</creatorcontrib><creatorcontrib>Sun, Lei</creatorcontrib><creatorcontrib>Xue, Jiabin</creatorcontrib><creatorcontrib>Liu, Dan</creatorcontrib><creatorcontrib>Li, Yongchao</creatorcontrib><creatorcontrib>Ma, Zhiqiang</creatorcontrib><creatorcontrib>Wu, Minghui</creatorcontrib><creatorcontrib>Pan, Jia</creatorcontrib><creatorcontrib>Wan, Genshun</creatorcontrib><creatorcontrib>Zhao, Ming'En</creatorcontrib><title>Reducing the GAP Between Streaming and Non-Streaming Transducer-Based ASR by Adaptive Two-Stage Knowledge Distillation</title><title>ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title><addtitle>ICASSP</addtitle><description>Transducer is one of the mainstream frameworks for streaming speech recognition. There is a performance gap between the streaming and non-streaming transducer models due to limited context. To reduce this gap, an effective way is to ensure that their hidden and output distributions are consistent, which can be achieved by hierarchical knowledge distillation. However, it is difficult to ensure the distribution consistency simultaneously because the learning of the output distribution depends on the hidden one. In this paper, we propose an adaptive two-stage knowledge distillation method consisting of hidden layer learning and output layer learning. In the former stage, we learn hidden representation with full context by applying mean square error loss function. In the latter stage, we design a power transformation based adaptive smoothness method to learn stable output distribution. It achieved 19% relative reduction in word error rate, and a faster response for the first token compared with the original streaming model in LibriSpeech corpus.</description><subject>Adaptation models</subject><subject>Conformer Transducer</subject><subject>Error analysis</subject><subject>Knowledge Distillation</subject><subject>Mean square error methods</subject><subject>Power Transformation</subject><subject>Signal processing</subject><subject>Speech recognition</subject><subject>Temperature distribution</subject><subject>Transducers</subject><issn>2379-190X</issn><isbn>1728163277</isbn><isbn>9781728163277</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2023</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNpFUMtOwzAQNEhItIU_4GA-IGX9SJwc0wIFUUHVFIlb5djrYpQ6VRJa9e8JAsRpRjOzK80Qcs1gzBhkN4_TvCgWMhOxGnPgYswAshgknJAhUzxlieBKnZIBFyqLWAZv52TYth8AkCqZDsh-ifbT-LCh3TvSWb6gE-wOiIEWXYN6--3oYOlzHaJ_ZdXo0PZ32EQT3aKlebGk5ZHmVu86v0e6OtR9XG-QPoX6UKHt2a1vO19VuvN1uCBnTlctXv7iiLze362mD9H8ZdZ3mkeeKYCoBKlFgkZnpTJOxyXykhtpQWgTcwdCMumAKyESphSkFpwohVMqESaRxokRufr56xFxvWv8VjfH9d9I4gtIBl2G</recordid><startdate>20230604</startdate><enddate>20230604</enddate><creator>Tang, Haitao</creator><creator>Fu, Yu</creator><creator>Sun, Lei</creator><creator>Xue, Jiabin</creator><creator>Liu, Dan</creator><creator>Li, Yongchao</creator><creator>Ma, Zhiqiang</creator><creator>Wu, Minghui</creator><creator>Pan, Jia</creator><creator>Wan, Genshun</creator><creator>Zhao, Ming'En</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>20230604</creationdate><title>Reducing the GAP Between Streaming and Non-Streaming Transducer-Based ASR by Adaptive Two-Stage Knowledge Distillation</title><author>Tang, Haitao ; Fu, Yu ; Sun, Lei ; Xue, Jiabin ; Liu, Dan ; Li, Yongchao ; Ma, Zhiqiang ; Wu, Minghui ; Pan, Jia ; Wan, Genshun ; Zhao, Ming'En</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i1700-b04a36eca9b7cfa5be2b2c4d03ac52f03414f02733617708d0f3b3f7763c64cf3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Adaptation models</topic><topic>Conformer Transducer</topic><topic>Error analysis</topic><topic>Knowledge Distillation</topic><topic>Mean square error methods</topic><topic>Power Transformation</topic><topic>Signal processing</topic><topic>Speech recognition</topic><topic>Temperature distribution</topic><topic>Transducers</topic><toplevel>online_resources</toplevel><creatorcontrib>Tang, Haitao</creatorcontrib><creatorcontrib>Fu, Yu</creatorcontrib><creatorcontrib>Sun, Lei</creatorcontrib><creatorcontrib>Xue, Jiabin</creatorcontrib><creatorcontrib>Liu, Dan</creatorcontrib><creatorcontrib>Li, Yongchao</creatorcontrib><creatorcontrib>Ma, Zhiqiang</creatorcontrib><creatorcontrib>Wu, Minghui</creatorcontrib><creatorcontrib>Pan, Jia</creatorcontrib><creatorcontrib>Wan, Genshun</creatorcontrib><creatorcontrib>Zhao, Ming'En</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Tang, Haitao</au><au>Fu, Yu</au><au>Sun, Lei</au><au>Xue, Jiabin</au><au>Liu, Dan</au><au>Li, Yongchao</au><au>Ma, Zhiqiang</au><au>Wu, Minghui</au><au>Pan, Jia</au><au>Wan, Genshun</au><au>Zhao, Ming'En</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Reducing the GAP Between Streaming and Non-Streaming Transducer-Based ASR by Adaptive Two-Stage Knowledge Distillation</atitle><btitle>ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</btitle><stitle>ICASSP</stitle><date>2023-06-04</date><risdate>2023</risdate><spage>1</spage><epage>5</epage><pages>1-5</pages><eissn>2379-190X</eissn><eisbn>1728163277</eisbn><eisbn>9781728163277</eisbn><abstract>Transducer is one of the mainstream frameworks for streaming speech recognition. There is a performance gap between the streaming and non-streaming transducer models due to limited context. To reduce this gap, an effective way is to ensure that their hidden and output distributions are consistent, which can be achieved by hierarchical knowledge distillation. However, it is difficult to ensure the distribution consistency simultaneously because the learning of the output distribution depends on the hidden one. In this paper, we propose an adaptive two-stage knowledge distillation method consisting of hidden layer learning and output layer learning. In the former stage, we learn hidden representation with full context by applying mean square error loss function. In the latter stage, we design a power transformation based adaptive smoothness method to learn stable output distribution. It achieved 19% relative reduction in word error rate, and a faster response for the first token compared with the original streaming model in LibriSpeech corpus.</abstract><pub>IEEE</pub><doi>10.1109/ICASSP49357.2023.10095040</doi><tpages>5</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	EISSN: 2379-190X
ispartof	ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, p.1-5
issn	2379-190X
language	eng
recordid	cdi_ieee_primary_10095040
source	IEEE Xplore All Conference Series
subjects	Adaptation models Conformer Transducer Error analysis Knowledge Distillation Mean square error methods Power Transformation Signal processing Speech recognition Temperature distribution Transducers
title	Reducing the GAP Between Streaming and Non-Streaming Transducer-Based ASR by Adaptive Two-Stage Knowledge Distillation
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-04T09%3A44%3A42IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Reducing%20the%20GAP%20Between%20Streaming%20and%20Non-Streaming%20Transducer-Based%20ASR%20by%20Adaptive%20Two-Stage%20Knowledge%20Distillation&rft.btitle=ICASSP%202023%20-%202023%20IEEE%20International%20Conference%20on%20Acoustics,%20Speech%20and%20Signal%20Processing%20(ICASSP)&rft.au=Tang,%20Haitao&rft.date=2023-06-04&rft.spage=1&rft.epage=5&rft.pages=1-5&rft.eissn=2379-190X&rft_id=info:doi/10.1109/ICASSP49357.2023.10095040&rft.eisbn=1728163277&rft.eisbn_list=9781728163277&rft_dat=%3Cieee_CHZPO%3E10095040%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i1700-b04a36eca9b7cfa5be2b2c4d03ac52f03414f02733617708d0f3b3f7763c64cf3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10095040&rfr_iscdi=true