Loading…

Brazilian Portuguese Speech Recognition Using Wav2vec 2.0

Deep learning techniques have been shown to be efficient in various tasks, especially in the development of speech recognition systems, that is, systems that aim to transcribe an audio sentence in a sequence of written words. Despite the progress in the area, speech recognition can still be consider...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2021-12
Main Authors:	Lucas Rafael Stefanel Gris, Casanova, Edresson, Santos de Oliveira, Frederico, Anderson da Silva Soares, Arnaldo Candido Junior
Format:	Article
Language:	English
Subjects:	Audio data Automatic speech recognition Deep learning Languages Machine learning Voice recognition
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Lucas Rafael Stefanel Gris Casanova, Edresson Santos de Oliveira, Frederico Anderson da Silva Soares Arnaldo Candido Junior
description	Deep learning techniques have been shown to be efficient in various tasks, especially in the development of speech recognition systems, that is, systems that aim to transcribe an audio sentence in a sequence of written words. Despite the progress in the area, speech recognition can still be considered difficult, especially for languages lacking available data, such as Brazilian Portuguese (BP). In this sense, this work presents the development of an public Automatic Speech Recognition (ASR) system using only open available audio data, from the fine-tuning of the Wav2vec 2.0 XLSR-53 model pre-trained in many languages, over BP data. The final model presents an average word error rate of 12.4% over 7 different datasets (10.5% when applying a language model). According to our knowledge, the obtained error is the lowest among open end-to-end (E2E) ASR models for BP.
format	article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2555489151</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2555489151</sourcerecordid><originalsourceid>FETCH-proquest_journals_25554891513</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwdCpKrMrMyUzMUwjILyopTS9NLU5VCC5ITU3OUAhKTc5Pz8ssyczPUwgtzsxLVwhPLDMqS01WMNIz4GFgTUvMKU7lhdLcDMpuriHOHroFRfmFQFNK4rPyS4vygFLxQKtMTSwsDU0NjYlTBQBDhjW0</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2555489151</pqid></control><display><type>article</type><title>Brazilian Portuguese Speech Recognition Using Wav2vec 2.0</title><source>Publicly Available Content (ProQuest)</source><creator>Lucas Rafael Stefanel Gris ; Casanova, Edresson ; Santos de Oliveira, Frederico ; Anderson da Silva Soares ; Arnaldo Candido Junior</creator><creatorcontrib>Lucas Rafael Stefanel Gris ; Casanova, Edresson ; Santos de Oliveira, Frederico ; Anderson da Silva Soares ; Arnaldo Candido Junior</creatorcontrib><description>Deep learning techniques have been shown to be efficient in various tasks, especially in the development of speech recognition systems, that is, systems that aim to transcribe an audio sentence in a sequence of written words. Despite the progress in the area, speech recognition can still be considered difficult, especially for languages lacking available data, such as Brazilian Portuguese (BP). In this sense, this work presents the development of an public Automatic Speech Recognition (ASR) system using only open available audio data, from the fine-tuning of the Wav2vec 2.0 XLSR-53 model pre-trained in many languages, over BP data. The final model presents an average word error rate of 12.4% over 7 different datasets (10.5% when applying a language model). According to our knowledge, the obtained error is the lowest among open end-to-end (E2E) ASR models for BP.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Audio data ; Automatic speech recognition ; Deep learning ; Languages ; Machine learning ; Voice recognition</subject><ispartof>arXiv.org, 2021-12</ispartof><rights>2021. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2555489151?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Lucas Rafael Stefanel Gris</creatorcontrib><creatorcontrib>Casanova, Edresson</creatorcontrib><creatorcontrib>Santos de Oliveira, Frederico</creatorcontrib><creatorcontrib>Anderson da Silva Soares</creatorcontrib><creatorcontrib>Arnaldo Candido Junior</creatorcontrib><title>Brazilian Portuguese Speech Recognition Using Wav2vec 2.0</title><title>arXiv.org</title><description>Deep learning techniques have been shown to be efficient in various tasks, especially in the development of speech recognition systems, that is, systems that aim to transcribe an audio sentence in a sequence of written words. Despite the progress in the area, speech recognition can still be considered difficult, especially for languages lacking available data, such as Brazilian Portuguese (BP). In this sense, this work presents the development of an public Automatic Speech Recognition (ASR) system using only open available audio data, from the fine-tuning of the Wav2vec 2.0 XLSR-53 model pre-trained in many languages, over BP data. The final model presents an average word error rate of 12.4% over 7 different datasets (10.5% when applying a language model). According to our knowledge, the obtained error is the lowest among open end-to-end (E2E) ASR models for BP.</description><subject>Audio data</subject><subject>Automatic speech recognition</subject><subject>Deep learning</subject><subject>Languages</subject><subject>Machine learning</subject><subject>Voice recognition</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwdCpKrMrMyUzMUwjILyopTS9NLU5VCC5ITU3OUAhKTc5Pz8ssyczPUwgtzsxLVwhPLDMqS01WMNIz4GFgTUvMKU7lhdLcDMpuriHOHroFRfmFQFNK4rPyS4vygFLxQKtMTSwsDU0NjYlTBQBDhjW0</recordid><startdate>20211222</startdate><enddate>20211222</enddate><creator>Lucas Rafael Stefanel Gris</creator><creator>Casanova, Edresson</creator><creator>Santos de Oliveira, Frederico</creator><creator>Anderson da Silva Soares</creator><creator>Arnaldo Candido Junior</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20211222</creationdate><title>Brazilian Portuguese Speech Recognition Using Wav2vec 2.0</title><author>Lucas Rafael Stefanel Gris ; Casanova, Edresson ; Santos de Oliveira, Frederico ; Anderson da Silva Soares ; Arnaldo Candido Junior</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_25554891513</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Audio data</topic><topic>Automatic speech recognition</topic><topic>Deep learning</topic><topic>Languages</topic><topic>Machine learning</topic><topic>Voice recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Lucas Rafael Stefanel Gris</creatorcontrib><creatorcontrib>Casanova, Edresson</creatorcontrib><creatorcontrib>Santos de Oliveira, Frederico</creatorcontrib><creatorcontrib>Anderson da Silva Soares</creatorcontrib><creatorcontrib>Arnaldo Candido Junior</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Lucas Rafael Stefanel Gris</au><au>Casanova, Edresson</au><au>Santos de Oliveira, Frederico</au><au>Anderson da Silva Soares</au><au>Arnaldo Candido Junior</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Brazilian Portuguese Speech Recognition Using Wav2vec 2.0</atitle><jtitle>arXiv.org</jtitle><date>2021-12-22</date><risdate>2021</risdate><eissn>2331-8422</eissn><abstract>Deep learning techniques have been shown to be efficient in various tasks, especially in the development of speech recognition systems, that is, systems that aim to transcribe an audio sentence in a sequence of written words. Despite the progress in the area, speech recognition can still be considered difficult, especially for languages lacking available data, such as Brazilian Portuguese (BP). In this sense, this work presents the development of an public Automatic Speech Recognition (ASR) system using only open available audio data, from the fine-tuning of the Wav2vec 2.0 XLSR-53 model pre-trained in many languages, over BP data. The final model presents an average word error rate of 12.4% over 7 different datasets (10.5% when applying a language model). According to our knowledge, the obtained error is the lowest among open end-to-end (E2E) ASR models for BP.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2021-12
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2555489151
source	Publicly Available Content (ProQuest)
subjects	Audio data Automatic speech recognition Deep learning Languages Machine learning Voice recognition
title	Brazilian Portuguese Speech Recognition Using Wav2vec 2.0
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T16%3A36%3A53IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Brazilian%20Portuguese%20Speech%20Recognition%20Using%20Wav2vec%202.0&rft.jtitle=arXiv.org&rft.au=Lucas%20Rafael%20Stefanel%20Gris&rft.date=2021-12-22&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2555489151%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_25554891513%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2555489151&rft_id=info:pmid/&rfr_iscdi=true