Loading…
Performance vs. hardware requirements in state-of-the-art automatic speech recognition
The last decade brought significant advances in automatic speech recognition (ASR) thanks to the evolution of deep learning methods. ASR systems evolved from pipeline-based systems, that modeled hand-crafted speech features with probabilistic frameworks and generated phone posteriors, to end-to-end...
Saved in:
Published in: | EURASIP journal on audio, speech, and music processing speech, and music processing, 2021-07, Vol.2021 (1), p.1-30, Article 28 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c429t-63a0c26ddc8949aa35d0aca1176b933c52c156e5454423e81c3caa48dc123c013 |
---|---|
cites | cdi_FETCH-LOGICAL-c429t-63a0c26ddc8949aa35d0aca1176b933c52c156e5454423e81c3caa48dc123c013 |
container_end_page | 30 |
container_issue | 1 |
container_start_page | 1 |
container_title | EURASIP journal on audio, speech, and music processing |
container_volume | 2021 |
creator | Georgescu, Alexandru-Lucian Pappalardo, Alessandro Cucu, Horia Blott, Michaela |
description | The last decade brought significant advances in automatic speech recognition (ASR) thanks to the evolution of deep learning methods. ASR systems evolved from pipeline-based systems, that modeled hand-crafted speech features with probabilistic frameworks and generated phone posteriors, to end-to-end (E2E) systems, that translate the raw waveform directly into words using one deep neural network (DNN). The transcription accuracy greatly increased, leading to ASR technology being integrated into many commercial applications. However, few of the existing ASR technologies are suitable for integration in embedded applications, due to their hard constrains related to computing power and memory usage. This overview paper serves as a guided tour through the recent literature on speech recognition and compares the most popular ASR implementations. The comparison emphasizes the trade-off between ASR performance and hardware requirements, to further serve decision makers in choosing the system which fits best their embedded application. To the best of our knowledge, this is the first study to provide this kind of trade-off analysis for state-of-the-art ASR systems. |
doi_str_mv | 10.1186/s13636-021-00217-4 |
format | article |
fullrecord | <record><control><sourceid>proquest_doaj_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_5791e9a5e49f4289b92701df43133aab</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><doaj_id>oai_doaj_org_article_5791e9a5e49f4289b92701df43133aab</doaj_id><sourcerecordid>2553619453</sourcerecordid><originalsourceid>FETCH-LOGICAL-c429t-63a0c26ddc8949aa35d0aca1176b933c52c156e5454423e81c3caa48dc123c013</originalsourceid><addsrcrecordid>eNp9UcFO3DAQjVCRoMAPcIrE2dTjsZ34iBAtSEjtoe3VmnUmu1mx8WJ7qfr3NaRqe-pl5mn03psnvaa5BHkN0NsPGdCiFVKBkHV0Qh81p2D7Cjql3v2DT5r3OW-lNGi0Om2-f-E0xrSjOXD7kq_bDaXhByVuEz8fpsQ7nktup7nNhQqLOIqyYUGptHQocUdlCm3eM4dNVYS4nqcyxfm8OR7pKfPF733WfPt49_X2Xjx-_vRwe_MoglauCIskg7LDEHqnHRGaQVIggM6uHGIwKoCxbLTRWiH3EDAQ6X4IoDBIwLPmYfEdIm39Pk07Sj99pMm_HWJa-xp1Ck_sTeeAHRnWbtSqdyunOgnDqBEQiVbV62rx2qf4fOBc_DYe0lzje2UMWnDaYGWphRVSzDnx-OcrSP_ahV-68LUG_9aF11WEiyhX8rzm9Nf6P6pf1EGLyQ</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2553619453</pqid></control><display><type>article</type><title>Performance vs. hardware requirements in state-of-the-art automatic speech recognition</title><source>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</source><source>Springer Nature - SpringerLink Journals - Fully Open Access</source><source>Linguistics and Language Behavior Abstracts (LLBA)</source><creator>Georgescu, Alexandru-Lucian ; Pappalardo, Alessandro ; Cucu, Horia ; Blott, Michaela</creator><creatorcontrib>Georgescu, Alexandru-Lucian ; Pappalardo, Alessandro ; Cucu, Horia ; Blott, Michaela</creatorcontrib><description>The last decade brought significant advances in automatic speech recognition (ASR) thanks to the evolution of deep learning methods. ASR systems evolved from pipeline-based systems, that modeled hand-crafted speech features with probabilistic frameworks and generated phone posteriors, to end-to-end (E2E) systems, that translate the raw waveform directly into words using one deep neural network (DNN). The transcription accuracy greatly increased, leading to ASR technology being integrated into many commercial applications. However, few of the existing ASR technologies are suitable for integration in embedded applications, due to their hard constrains related to computing power and memory usage. This overview paper serves as a guided tour through the recent literature on speech recognition and compares the most popular ASR implementations. The comparison emphasizes the trade-off between ASR performance and hardware requirements, to further serve decision makers in choosing the system which fits best their embedded application. To the best of our knowledge, this is the first study to provide this kind of trade-off analysis for state-of-the-art ASR systems.</description><identifier>ISSN: 1687-4722</identifier><identifier>ISSN: 1687-4714</identifier><identifier>EISSN: 1687-4722</identifier><identifier>DOI: 10.1186/s13636-021-00217-4</identifier><language>eng</language><publisher>Cham: Springer International Publishing</publisher><subject>Acoustics ; Artificial neural networks ; Automatic speech recognition ; Deep learning ; End-to-end ASR systems ; Engineering ; Engineering Acoustics ; Evolution ; Hardware ; Machine learning ; Mathematics in Music ; Performance analysis ; Review ; Signal,Image and Speech Processing ; Speech recognition ; Survey ; Tradeoffs ; Transcription ; Voice recognition ; Waveforms</subject><ispartof>EURASIP journal on audio, speech, and music processing, 2021-07, Vol.2021 (1), p.1-30, Article 28</ispartof><rights>The Author(s) 2021</rights><rights>The Author(s) 2021. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c429t-63a0c26ddc8949aa35d0aca1176b933c52c156e5454423e81c3caa48dc123c013</citedby><cites>FETCH-LOGICAL-c429t-63a0c26ddc8949aa35d0aca1176b933c52c156e5454423e81c3caa48dc123c013</cites><orcidid>0000-0003-2122-4997</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.proquest.com/docview/2553619453/fulltextPDF?pq-origsite=primo$$EPDF$$P50$$Gproquest$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2553619453?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,12849,25751,27922,27923,31267,37010,44588,74896</link.rule.ids></links><search><creatorcontrib>Georgescu, Alexandru-Lucian</creatorcontrib><creatorcontrib>Pappalardo, Alessandro</creatorcontrib><creatorcontrib>Cucu, Horia</creatorcontrib><creatorcontrib>Blott, Michaela</creatorcontrib><title>Performance vs. hardware requirements in state-of-the-art automatic speech recognition</title><title>EURASIP journal on audio, speech, and music processing</title><addtitle>J AUDIO SPEECH MUSIC PROC</addtitle><description>The last decade brought significant advances in automatic speech recognition (ASR) thanks to the evolution of deep learning methods. ASR systems evolved from pipeline-based systems, that modeled hand-crafted speech features with probabilistic frameworks and generated phone posteriors, to end-to-end (E2E) systems, that translate the raw waveform directly into words using one deep neural network (DNN). The transcription accuracy greatly increased, leading to ASR technology being integrated into many commercial applications. However, few of the existing ASR technologies are suitable for integration in embedded applications, due to their hard constrains related to computing power and memory usage. This overview paper serves as a guided tour through the recent literature on speech recognition and compares the most popular ASR implementations. The comparison emphasizes the trade-off between ASR performance and hardware requirements, to further serve decision makers in choosing the system which fits best their embedded application. To the best of our knowledge, this is the first study to provide this kind of trade-off analysis for state-of-the-art ASR systems.</description><subject>Acoustics</subject><subject>Artificial neural networks</subject><subject>Automatic speech recognition</subject><subject>Deep learning</subject><subject>End-to-end ASR systems</subject><subject>Engineering</subject><subject>Engineering Acoustics</subject><subject>Evolution</subject><subject>Hardware</subject><subject>Machine learning</subject><subject>Mathematics in Music</subject><subject>Performance analysis</subject><subject>Review</subject><subject>Signal,Image and Speech Processing</subject><subject>Speech recognition</subject><subject>Survey</subject><subject>Tradeoffs</subject><subject>Transcription</subject><subject>Voice recognition</subject><subject>Waveforms</subject><issn>1687-4722</issn><issn>1687-4714</issn><issn>1687-4722</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>7T9</sourceid><sourceid>PIMPY</sourceid><sourceid>DOA</sourceid><recordid>eNp9UcFO3DAQjVCRoMAPcIrE2dTjsZ34iBAtSEjtoe3VmnUmu1mx8WJ7qfr3NaRqe-pl5mn03psnvaa5BHkN0NsPGdCiFVKBkHV0Qh81p2D7Cjql3v2DT5r3OW-lNGi0Om2-f-E0xrSjOXD7kq_bDaXhByVuEz8fpsQ7nktup7nNhQqLOIqyYUGptHQocUdlCm3eM4dNVYS4nqcyxfm8OR7pKfPF733WfPt49_X2Xjx-_vRwe_MoglauCIskg7LDEHqnHRGaQVIggM6uHGIwKoCxbLTRWiH3EDAQ6X4IoDBIwLPmYfEdIm39Pk07Sj99pMm_HWJa-xp1Ck_sTeeAHRnWbtSqdyunOgnDqBEQiVbV62rx2qf4fOBc_DYe0lzje2UMWnDaYGWphRVSzDnx-OcrSP_ahV-68LUG_9aF11WEiyhX8rzm9Nf6P6pf1EGLyQ</recordid><startdate>20210721</startdate><enddate>20210721</enddate><creator>Georgescu, Alexandru-Lucian</creator><creator>Pappalardo, Alessandro</creator><creator>Cucu, Horia</creator><creator>Blott, Michaela</creator><general>Springer International Publishing</general><general>Springer Nature B.V</general><general>SpringerOpen</general><scope>C6C</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7T9</scope><scope>8FE</scope><scope>8FG</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>P5Z</scope><scope>P62</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0003-2122-4997</orcidid></search><sort><creationdate>20210721</creationdate><title>Performance vs. hardware requirements in state-of-the-art automatic speech recognition</title><author>Georgescu, Alexandru-Lucian ; Pappalardo, Alessandro ; Cucu, Horia ; Blott, Michaela</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c429t-63a0c26ddc8949aa35d0aca1176b933c52c156e5454423e81c3caa48dc123c013</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Acoustics</topic><topic>Artificial neural networks</topic><topic>Automatic speech recognition</topic><topic>Deep learning</topic><topic>End-to-end ASR systems</topic><topic>Engineering</topic><topic>Engineering Acoustics</topic><topic>Evolution</topic><topic>Hardware</topic><topic>Machine learning</topic><topic>Mathematics in Music</topic><topic>Performance analysis</topic><topic>Review</topic><topic>Signal,Image and Speech Processing</topic><topic>Speech recognition</topic><topic>Survey</topic><topic>Tradeoffs</topic><topic>Transcription</topic><topic>Voice recognition</topic><topic>Waveforms</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Georgescu, Alexandru-Lucian</creatorcontrib><creatorcontrib>Pappalardo, Alessandro</creatorcontrib><creatorcontrib>Cucu, Horia</creatorcontrib><creatorcontrib>Blott, Michaela</creatorcontrib><collection>SpringerOpen (Open Access)</collection><collection>CrossRef</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>Advanced Technologies & Aerospace Database (1962 - current)</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection (Proquest) (PQ_SDU_P3)</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Directory of Open Access Journals</collection><jtitle>EURASIP journal on audio, speech, and music processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Georgescu, Alexandru-Lucian</au><au>Pappalardo, Alessandro</au><au>Cucu, Horia</au><au>Blott, Michaela</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Performance vs. hardware requirements in state-of-the-art automatic speech recognition</atitle><jtitle>EURASIP journal on audio, speech, and music processing</jtitle><stitle>J AUDIO SPEECH MUSIC PROC</stitle><date>2021-07-21</date><risdate>2021</risdate><volume>2021</volume><issue>1</issue><spage>1</spage><epage>30</epage><pages>1-30</pages><artnum>28</artnum><issn>1687-4722</issn><issn>1687-4714</issn><eissn>1687-4722</eissn><abstract>The last decade brought significant advances in automatic speech recognition (ASR) thanks to the evolution of deep learning methods. ASR systems evolved from pipeline-based systems, that modeled hand-crafted speech features with probabilistic frameworks and generated phone posteriors, to end-to-end (E2E) systems, that translate the raw waveform directly into words using one deep neural network (DNN). The transcription accuracy greatly increased, leading to ASR technology being integrated into many commercial applications. However, few of the existing ASR technologies are suitable for integration in embedded applications, due to their hard constrains related to computing power and memory usage. This overview paper serves as a guided tour through the recent literature on speech recognition and compares the most popular ASR implementations. The comparison emphasizes the trade-off between ASR performance and hardware requirements, to further serve decision makers in choosing the system which fits best their embedded application. To the best of our knowledge, this is the first study to provide this kind of trade-off analysis for state-of-the-art ASR systems.</abstract><cop>Cham</cop><pub>Springer International Publishing</pub><doi>10.1186/s13636-021-00217-4</doi><tpages>30</tpages><orcidid>https://orcid.org/0000-0003-2122-4997</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1687-4722 |
ispartof | EURASIP journal on audio, speech, and music processing, 2021-07, Vol.2021 (1), p.1-30, Article 28 |
issn | 1687-4722 1687-4714 1687-4722 |
language | eng |
recordid | cdi_doaj_primary_oai_doaj_org_article_5791e9a5e49f4289b92701df43133aab |
source | Publicly Available Content Database (Proquest) (PQ_SDU_P3); Springer Nature - SpringerLink Journals - Fully Open Access; Linguistics and Language Behavior Abstracts (LLBA) |
subjects | Acoustics Artificial neural networks Automatic speech recognition Deep learning End-to-end ASR systems Engineering Engineering Acoustics Evolution Hardware Machine learning Mathematics in Music Performance analysis Review Signal,Image and Speech Processing Speech recognition Survey Tradeoffs Transcription Voice recognition Waveforms |
title | Performance vs. hardware requirements in state-of-the-art automatic speech recognition |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-14T11%3A35%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Performance%20vs.%20hardware%20requirements%20in%20state-of-the-art%20automatic%20speech%20recognition&rft.jtitle=EURASIP%20journal%20on%20audio,%20speech,%20and%20music%20processing&rft.au=Georgescu,%20Alexandru-Lucian&rft.date=2021-07-21&rft.volume=2021&rft.issue=1&rft.spage=1&rft.epage=30&rft.pages=1-30&rft.artnum=28&rft.issn=1687-4722&rft.eissn=1687-4722&rft_id=info:doi/10.1186/s13636-021-00217-4&rft_dat=%3Cproquest_doaj_%3E2553619453%3C/proquest_doaj_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c429t-63a0c26ddc8949aa35d0aca1176b933c52c156e5454423e81c3caa48dc123c013%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2553619453&rft_id=info:pmid/&rfr_iscdi=true |