Loading…

Performance vs. hardware requirements in state-of-the-art automatic speech recognition

The last decade brought significant advances in automatic speech recognition (ASR) thanks to the evolution of deep learning methods. ASR systems evolved from pipeline-based systems, that modeled hand-crafted speech features with probabilistic frameworks and generated phone posteriors, to end-to-end...

Full description

Saved in:
Bibliographic Details
Published in:EURASIP journal on audio, speech, and music processing speech, and music processing, 2021-07, Vol.2021 (1), p.1-30, Article 28
Main Authors: Georgescu, Alexandru-Lucian, Pappalardo, Alessandro, Cucu, Horia, Blott, Michaela
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c429t-63a0c26ddc8949aa35d0aca1176b933c52c156e5454423e81c3caa48dc123c013
cites cdi_FETCH-LOGICAL-c429t-63a0c26ddc8949aa35d0aca1176b933c52c156e5454423e81c3caa48dc123c013
container_end_page 30
container_issue 1
container_start_page 1
container_title EURASIP journal on audio, speech, and music processing
container_volume 2021
creator Georgescu, Alexandru-Lucian
Pappalardo, Alessandro
Cucu, Horia
Blott, Michaela
description The last decade brought significant advances in automatic speech recognition (ASR) thanks to the evolution of deep learning methods. ASR systems evolved from pipeline-based systems, that modeled hand-crafted speech features with probabilistic frameworks and generated phone posteriors, to end-to-end (E2E) systems, that translate the raw waveform directly into words using one deep neural network (DNN). The transcription accuracy greatly increased, leading to ASR technology being integrated into many commercial applications. However, few of the existing ASR technologies are suitable for integration in embedded applications, due to their hard constrains related to computing power and memory usage. This overview paper serves as a guided tour through the recent literature on speech recognition and compares the most popular ASR implementations. The comparison emphasizes the trade-off between ASR performance and hardware requirements, to further serve decision makers in choosing the system which fits best their embedded application. To the best of our knowledge, this is the first study to provide this kind of trade-off analysis for state-of-the-art ASR systems.
doi_str_mv 10.1186/s13636-021-00217-4
format article
fullrecord <record><control><sourceid>proquest_doaj_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_5791e9a5e49f4289b92701df43133aab</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><doaj_id>oai_doaj_org_article_5791e9a5e49f4289b92701df43133aab</doaj_id><sourcerecordid>2553619453</sourcerecordid><originalsourceid>FETCH-LOGICAL-c429t-63a0c26ddc8949aa35d0aca1176b933c52c156e5454423e81c3caa48dc123c013</originalsourceid><addsrcrecordid>eNp9UcFO3DAQjVCRoMAPcIrE2dTjsZ34iBAtSEjtoe3VmnUmu1mx8WJ7qfr3NaRqe-pl5mn03psnvaa5BHkN0NsPGdCiFVKBkHV0Qh81p2D7Cjql3v2DT5r3OW-lNGi0Om2-f-E0xrSjOXD7kq_bDaXhByVuEz8fpsQ7nktup7nNhQqLOIqyYUGptHQocUdlCm3eM4dNVYS4nqcyxfm8OR7pKfPF733WfPt49_X2Xjx-_vRwe_MoglauCIskg7LDEHqnHRGaQVIggM6uHGIwKoCxbLTRWiH3EDAQ6X4IoDBIwLPmYfEdIm39Pk07Sj99pMm_HWJa-xp1Ck_sTeeAHRnWbtSqdyunOgnDqBEQiVbV62rx2qf4fOBc_DYe0lzje2UMWnDaYGWphRVSzDnx-OcrSP_ahV-68LUG_9aF11WEiyhX8rzm9Nf6P6pf1EGLyQ</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2553619453</pqid></control><display><type>article</type><title>Performance vs. hardware requirements in state-of-the-art automatic speech recognition</title><source>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</source><source>Springer Nature - SpringerLink Journals - Fully Open Access</source><source>Linguistics and Language Behavior Abstracts (LLBA)</source><creator>Georgescu, Alexandru-Lucian ; Pappalardo, Alessandro ; Cucu, Horia ; Blott, Michaela</creator><creatorcontrib>Georgescu, Alexandru-Lucian ; Pappalardo, Alessandro ; Cucu, Horia ; Blott, Michaela</creatorcontrib><description>The last decade brought significant advances in automatic speech recognition (ASR) thanks to the evolution of deep learning methods. ASR systems evolved from pipeline-based systems, that modeled hand-crafted speech features with probabilistic frameworks and generated phone posteriors, to end-to-end (E2E) systems, that translate the raw waveform directly into words using one deep neural network (DNN). The transcription accuracy greatly increased, leading to ASR technology being integrated into many commercial applications. However, few of the existing ASR technologies are suitable for integration in embedded applications, due to their hard constrains related to computing power and memory usage. This overview paper serves as a guided tour through the recent literature on speech recognition and compares the most popular ASR implementations. The comparison emphasizes the trade-off between ASR performance and hardware requirements, to further serve decision makers in choosing the system which fits best their embedded application. To the best of our knowledge, this is the first study to provide this kind of trade-off analysis for state-of-the-art ASR systems.</description><identifier>ISSN: 1687-4722</identifier><identifier>ISSN: 1687-4714</identifier><identifier>EISSN: 1687-4722</identifier><identifier>DOI: 10.1186/s13636-021-00217-4</identifier><language>eng</language><publisher>Cham: Springer International Publishing</publisher><subject>Acoustics ; Artificial neural networks ; Automatic speech recognition ; Deep learning ; End-to-end ASR systems ; Engineering ; Engineering Acoustics ; Evolution ; Hardware ; Machine learning ; Mathematics in Music ; Performance analysis ; Review ; Signal,Image and Speech Processing ; Speech recognition ; Survey ; Tradeoffs ; Transcription ; Voice recognition ; Waveforms</subject><ispartof>EURASIP journal on audio, speech, and music processing, 2021-07, Vol.2021 (1), p.1-30, Article 28</ispartof><rights>The Author(s) 2021</rights><rights>The Author(s) 2021. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c429t-63a0c26ddc8949aa35d0aca1176b933c52c156e5454423e81c3caa48dc123c013</citedby><cites>FETCH-LOGICAL-c429t-63a0c26ddc8949aa35d0aca1176b933c52c156e5454423e81c3caa48dc123c013</cites><orcidid>0000-0003-2122-4997</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.proquest.com/docview/2553619453/fulltextPDF?pq-origsite=primo$$EPDF$$P50$$Gproquest$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2553619453?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,12849,25751,27922,27923,31267,37010,44588,74896</link.rule.ids></links><search><creatorcontrib>Georgescu, Alexandru-Lucian</creatorcontrib><creatorcontrib>Pappalardo, Alessandro</creatorcontrib><creatorcontrib>Cucu, Horia</creatorcontrib><creatorcontrib>Blott, Michaela</creatorcontrib><title>Performance vs. hardware requirements in state-of-the-art automatic speech recognition</title><title>EURASIP journal on audio, speech, and music processing</title><addtitle>J AUDIO SPEECH MUSIC PROC</addtitle><description>The last decade brought significant advances in automatic speech recognition (ASR) thanks to the evolution of deep learning methods. ASR systems evolved from pipeline-based systems, that modeled hand-crafted speech features with probabilistic frameworks and generated phone posteriors, to end-to-end (E2E) systems, that translate the raw waveform directly into words using one deep neural network (DNN). The transcription accuracy greatly increased, leading to ASR technology being integrated into many commercial applications. However, few of the existing ASR technologies are suitable for integration in embedded applications, due to their hard constrains related to computing power and memory usage. This overview paper serves as a guided tour through the recent literature on speech recognition and compares the most popular ASR implementations. The comparison emphasizes the trade-off between ASR performance and hardware requirements, to further serve decision makers in choosing the system which fits best their embedded application. To the best of our knowledge, this is the first study to provide this kind of trade-off analysis for state-of-the-art ASR systems.</description><subject>Acoustics</subject><subject>Artificial neural networks</subject><subject>Automatic speech recognition</subject><subject>Deep learning</subject><subject>End-to-end ASR systems</subject><subject>Engineering</subject><subject>Engineering Acoustics</subject><subject>Evolution</subject><subject>Hardware</subject><subject>Machine learning</subject><subject>Mathematics in Music</subject><subject>Performance analysis</subject><subject>Review</subject><subject>Signal,Image and Speech Processing</subject><subject>Speech recognition</subject><subject>Survey</subject><subject>Tradeoffs</subject><subject>Transcription</subject><subject>Voice recognition</subject><subject>Waveforms</subject><issn>1687-4722</issn><issn>1687-4714</issn><issn>1687-4722</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>7T9</sourceid><sourceid>PIMPY</sourceid><sourceid>DOA</sourceid><recordid>eNp9UcFO3DAQjVCRoMAPcIrE2dTjsZ34iBAtSEjtoe3VmnUmu1mx8WJ7qfr3NaRqe-pl5mn03psnvaa5BHkN0NsPGdCiFVKBkHV0Qh81p2D7Cjql3v2DT5r3OW-lNGi0Om2-f-E0xrSjOXD7kq_bDaXhByVuEz8fpsQ7nktup7nNhQqLOIqyYUGptHQocUdlCm3eM4dNVYS4nqcyxfm8OR7pKfPF733WfPt49_X2Xjx-_vRwe_MoglauCIskg7LDEHqnHRGaQVIggM6uHGIwKoCxbLTRWiH3EDAQ6X4IoDBIwLPmYfEdIm39Pk07Sj99pMm_HWJa-xp1Ck_sTeeAHRnWbtSqdyunOgnDqBEQiVbV62rx2qf4fOBc_DYe0lzje2UMWnDaYGWphRVSzDnx-OcrSP_ahV-68LUG_9aF11WEiyhX8rzm9Nf6P6pf1EGLyQ</recordid><startdate>20210721</startdate><enddate>20210721</enddate><creator>Georgescu, Alexandru-Lucian</creator><creator>Pappalardo, Alessandro</creator><creator>Cucu, Horia</creator><creator>Blott, Michaela</creator><general>Springer International Publishing</general><general>Springer Nature B.V</general><general>SpringerOpen</general><scope>C6C</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7T9</scope><scope>8FE</scope><scope>8FG</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>P5Z</scope><scope>P62</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0003-2122-4997</orcidid></search><sort><creationdate>20210721</creationdate><title>Performance vs. hardware requirements in state-of-the-art automatic speech recognition</title><author>Georgescu, Alexandru-Lucian ; Pappalardo, Alessandro ; Cucu, Horia ; Blott, Michaela</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c429t-63a0c26ddc8949aa35d0aca1176b933c52c156e5454423e81c3caa48dc123c013</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Acoustics</topic><topic>Artificial neural networks</topic><topic>Automatic speech recognition</topic><topic>Deep learning</topic><topic>End-to-end ASR systems</topic><topic>Engineering</topic><topic>Engineering Acoustics</topic><topic>Evolution</topic><topic>Hardware</topic><topic>Machine learning</topic><topic>Mathematics in Music</topic><topic>Performance analysis</topic><topic>Review</topic><topic>Signal,Image and Speech Processing</topic><topic>Speech recognition</topic><topic>Survey</topic><topic>Tradeoffs</topic><topic>Transcription</topic><topic>Voice recognition</topic><topic>Waveforms</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Georgescu, Alexandru-Lucian</creatorcontrib><creatorcontrib>Pappalardo, Alessandro</creatorcontrib><creatorcontrib>Cucu, Horia</creatorcontrib><creatorcontrib>Blott, Michaela</creatorcontrib><collection>SpringerOpen (Open Access)</collection><collection>CrossRef</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>Advanced Technologies &amp; Aerospace Database‎ (1962 - current)</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection (Proquest) (PQ_SDU_P3)</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Directory of Open Access Journals</collection><jtitle>EURASIP journal on audio, speech, and music processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Georgescu, Alexandru-Lucian</au><au>Pappalardo, Alessandro</au><au>Cucu, Horia</au><au>Blott, Michaela</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Performance vs. hardware requirements in state-of-the-art automatic speech recognition</atitle><jtitle>EURASIP journal on audio, speech, and music processing</jtitle><stitle>J AUDIO SPEECH MUSIC PROC</stitle><date>2021-07-21</date><risdate>2021</risdate><volume>2021</volume><issue>1</issue><spage>1</spage><epage>30</epage><pages>1-30</pages><artnum>28</artnum><issn>1687-4722</issn><issn>1687-4714</issn><eissn>1687-4722</eissn><abstract>The last decade brought significant advances in automatic speech recognition (ASR) thanks to the evolution of deep learning methods. ASR systems evolved from pipeline-based systems, that modeled hand-crafted speech features with probabilistic frameworks and generated phone posteriors, to end-to-end (E2E) systems, that translate the raw waveform directly into words using one deep neural network (DNN). The transcription accuracy greatly increased, leading to ASR technology being integrated into many commercial applications. However, few of the existing ASR technologies are suitable for integration in embedded applications, due to their hard constrains related to computing power and memory usage. This overview paper serves as a guided tour through the recent literature on speech recognition and compares the most popular ASR implementations. The comparison emphasizes the trade-off between ASR performance and hardware requirements, to further serve decision makers in choosing the system which fits best their embedded application. To the best of our knowledge, this is the first study to provide this kind of trade-off analysis for state-of-the-art ASR systems.</abstract><cop>Cham</cop><pub>Springer International Publishing</pub><doi>10.1186/s13636-021-00217-4</doi><tpages>30</tpages><orcidid>https://orcid.org/0000-0003-2122-4997</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1687-4722
ispartof EURASIP journal on audio, speech, and music processing, 2021-07, Vol.2021 (1), p.1-30, Article 28
issn 1687-4722
1687-4714
1687-4722
language eng
recordid cdi_doaj_primary_oai_doaj_org_article_5791e9a5e49f4289b92701df43133aab
source Publicly Available Content Database (Proquest) (PQ_SDU_P3); Springer Nature - SpringerLink Journals - Fully Open Access; Linguistics and Language Behavior Abstracts (LLBA)
subjects Acoustics
Artificial neural networks
Automatic speech recognition
Deep learning
End-to-end ASR systems
Engineering
Engineering Acoustics
Evolution
Hardware
Machine learning
Mathematics in Music
Performance analysis
Review
Signal,Image and Speech Processing
Speech recognition
Survey
Tradeoffs
Transcription
Voice recognition
Waveforms
title Performance vs. hardware requirements in state-of-the-art automatic speech recognition
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-14T11%3A35%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Performance%20vs.%20hardware%20requirements%20in%20state-of-the-art%20automatic%20speech%20recognition&rft.jtitle=EURASIP%20journal%20on%20audio,%20speech,%20and%20music%20processing&rft.au=Georgescu,%20Alexandru-Lucian&rft.date=2021-07-21&rft.volume=2021&rft.issue=1&rft.spage=1&rft.epage=30&rft.pages=1-30&rft.artnum=28&rft.issn=1687-4722&rft.eissn=1687-4722&rft_id=info:doi/10.1186/s13636-021-00217-4&rft_dat=%3Cproquest_doaj_%3E2553619453%3C/proquest_doaj_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c429t-63a0c26ddc8949aa35d0aca1176b933c52c156e5454423e81c3caa48dc123c013%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2553619453&rft_id=info:pmid/&rfr_iscdi=true