Loading…

CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese

Automatic Speech recognition (ASR) is a complex and challenging task. In recent years, there have been significant advances in the area. In particular, for the Brazilian Portuguese (BP) language, there were around 376 h publicly available for the ASR task until the second half of 2020. With the rele...

Full description

Saved in:
Bibliographic Details
Published in:Language resources and evaluation 2023-09, Vol.57 (3), p.1139-1171
Main Authors: Candido Junior, Arnaldo, Casanova, Edresson, Soares, Anderson, de Oliveira, Frederico Santos, Oliveira, Lucas, Junior, Ricardo Corso Fernandes, da Silva, Daniel Peixoto Pinto, Fayet, Fernando Gorgulho, Carlotto, Bruno Baldissera, Gris, Lucas Rafael Stefanel, Aluísio, Sandra Maria
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c314t-8a904658232cb9eb9d37626d121fab647b1981d51faea7809742adf8b4cbb4603
container_end_page 1171
container_issue 3
container_start_page 1139
container_title Language resources and evaluation
container_volume 57
creator Candido Junior, Arnaldo
Casanova, Edresson
Soares, Anderson
de Oliveira, Frederico Santos
Oliveira, Lucas
Junior, Ricardo Corso Fernandes
da Silva, Daniel Peixoto Pinto
Fayet, Fernando Gorgulho
Carlotto, Bruno Baldissera
Gris, Lucas Rafael Stefanel
Aluísio, Sandra Maria
description Automatic Speech recognition (ASR) is a complex and challenging task. In recent years, there have been significant advances in the area. In particular, for the Brazilian Portuguese (BP) language, there were around 376 h publicly available for the ASR task until the second half of 2020. With the release of new datasets in early 2021, this number increased to 574 h. The existing resources, however, are composed of audios containing only read and prepared speech. There is a lack of datasets including spontaneous speech, which are essential in several ASR applications. This paper presents CORAA (Corpus of Annotated Audios) ASR with 290 h, a publicly available dataset for ASR in BP containing validated pairs of audio-transcription. CORAA ASR also contains European Portuguese audios (4.6 h). We also present a public ASR model based on Wav2Vec 2.0 XLSR-53, fine-tuned over CORAA ASR. Our model achieved a Word Error Rate (WER) of 24.18% on CORAA ASR test set and 20.08% on Common Voice test set. When measuring the Character Error Rate (CER), we obtained 11.02% and 6.34% for CORAA ASR and Common Voice, respectively. CORAA ASR corpora were assembled to both improve ASR models in BP with phenomena from spontaneous speech and motivate young researchers to start their studies on ASR for Portuguese. All the corpora are publicly available at https://github.com/nilc-nlp/CORAA under the CC BY-NC-ND 4.0 license.
doi_str_mv 10.1007/s10579-022-09621-4
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2853123860</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2853123860</sourcerecordid><originalsourceid>FETCH-LOGICAL-c314t-8a904658232cb9eb9d37626d121fab647b1981d51faea7809742adf8b4cbb4603</originalsourceid><addsrcrecordid>eNp9UMtKxDAUDaLgOPoDrgKuq0maNqm7cfAFAyOjgrtw26a1Q01q0grjwm83Tn3sXN3HecFB6JiSU0qIOPOUJCKLCGMRyVJGI76DJjQRPLyo3P3dydM-OvB-TQhnXMgJ-pgvV7MZnt2vzjHgFlytcWFdN3hsK-w7a3ow2oYTTIk7pztwugyA1sUzfgEzQNtu8Bu0TQl9QCrrflCnC1ubpm-swY3BFw7em7YBg--s64d60F4for0KWq-PvucUPV5dPsxvosXy-nY-W0RFTHkfScgITxPJYlbkmc6zMhYpS0vKaAV5ykVOM0nLJFwahCSZ4AzKSua8yHOekniKTkbfztnXENyrtR2cCZGKySSmLJZbFhtZhbPeO12pzjUv4DaKEvXVsxp7VqFnte1Z8SCKR5EPZFNr92f9j-oTyxGBZA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2853123860</pqid></control><display><type>article</type><title>CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese</title><source>Art, Design and Architecture Collection</source><source>Social Science Premium Collection</source><source>Springer Nature</source><source>Linguistics Collection</source><source>ProQuest One Literature</source><source>Linguistics and Language Behavior Abstracts (LLBA)</source><creator>Candido Junior, Arnaldo ; Casanova, Edresson ; Soares, Anderson ; de Oliveira, Frederico Santos ; Oliveira, Lucas ; Junior, Ricardo Corso Fernandes ; da Silva, Daniel Peixoto Pinto ; Fayet, Fernando Gorgulho ; Carlotto, Bruno Baldissera ; Gris, Lucas Rafael Stefanel ; Aluísio, Sandra Maria</creator><creatorcontrib>Candido Junior, Arnaldo ; Casanova, Edresson ; Soares, Anderson ; de Oliveira, Frederico Santos ; Oliveira, Lucas ; Junior, Ricardo Corso Fernandes ; da Silva, Daniel Peixoto Pinto ; Fayet, Fernando Gorgulho ; Carlotto, Bruno Baldissera ; Gris, Lucas Rafael Stefanel ; Aluísio, Sandra Maria</creatorcontrib><description>Automatic Speech recognition (ASR) is a complex and challenging task. In recent years, there have been significant advances in the area. In particular, for the Brazilian Portuguese (BP) language, there were around 376 h publicly available for the ASR task until the second half of 2020. With the release of new datasets in early 2021, this number increased to 574 h. The existing resources, however, are composed of audios containing only read and prepared speech. There is a lack of datasets including spontaneous speech, which are essential in several ASR applications. This paper presents CORAA (Corpus of Annotated Audios) ASR with 290 h, a publicly available dataset for ASR in BP containing validated pairs of audio-transcription. CORAA ASR also contains European Portuguese audios (4.6 h). We also present a public ASR model based on Wav2Vec 2.0 XLSR-53, fine-tuned over CORAA ASR. Our model achieved a Word Error Rate (WER) of 24.18% on CORAA ASR test set and 20.08% on Common Voice test set. When measuring the Character Error Rate (CER), we obtained 11.02% and 6.34% for CORAA ASR and Common Voice, respectively. CORAA ASR corpora were assembled to both improve ASR models in BP with phenomena from spontaneous speech and motivate young researchers to start their studies on ASR for Portuguese. All the corpora are publicly available at https://github.com/nilc-nlp/CORAA under the CC BY-NC-ND 4.0 license.</description><identifier>ISSN: 1574-020X</identifier><identifier>EISSN: 1574-0218</identifier><identifier>DOI: 10.1007/s10579-022-09621-4</identifier><language>eng</language><publisher>Dordrecht: Springer Netherlands</publisher><subject>Automatic speech recognition ; Brazilian Portuguese ; Computational Linguistics ; Computer Science ; Corpus linguistics ; Datasets ; Error analysis ; Language and Literature ; Linguistics ; Original Paper ; Portuguese language ; Social Sciences ; Speech ; Speech recognition ; Spontaneous speech ; Test sets ; Transcription ; Voice recognition</subject><ispartof>Language resources and evaluation, 2023-09, Vol.57 (3), p.1139-1171</ispartof><rights>The Author(s) 2022</rights><rights>The Author(s) 2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c314t-8a904658232cb9eb9d37626d121fab647b1981d51faea7809742adf8b4cbb4603</cites><orcidid>0000-0002-2967-6077 ; 0000-0002-5885-6747 ; 0000-0001-5108-2630 ; 0000-0002-5647-0891 ; 0000-0003-0160-7173</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.proquest.com/docview/2853123860/fulltextPDF?pq-origsite=primo$$EPDF$$P50$$Gproquest$$H</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2853123860?pq-origsite=primo$$EHTML$$P50$$Gproquest$$H</linktohtml><link.rule.ids>314,780,784,12850,12860,21381,21393,27923,27924,31268,33610,33910,34774,43732,43895,44199,62660,62661,62676,73967,73992,74184,74499</link.rule.ids></links><search><creatorcontrib>Candido Junior, Arnaldo</creatorcontrib><creatorcontrib>Casanova, Edresson</creatorcontrib><creatorcontrib>Soares, Anderson</creatorcontrib><creatorcontrib>de Oliveira, Frederico Santos</creatorcontrib><creatorcontrib>Oliveira, Lucas</creatorcontrib><creatorcontrib>Junior, Ricardo Corso Fernandes</creatorcontrib><creatorcontrib>da Silva, Daniel Peixoto Pinto</creatorcontrib><creatorcontrib>Fayet, Fernando Gorgulho</creatorcontrib><creatorcontrib>Carlotto, Bruno Baldissera</creatorcontrib><creatorcontrib>Gris, Lucas Rafael Stefanel</creatorcontrib><creatorcontrib>Aluísio, Sandra Maria</creatorcontrib><title>CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese</title><title>Language resources and evaluation</title><addtitle>Lang Resources &amp; Evaluation</addtitle><description>Automatic Speech recognition (ASR) is a complex and challenging task. In recent years, there have been significant advances in the area. In particular, for the Brazilian Portuguese (BP) language, there were around 376 h publicly available for the ASR task until the second half of 2020. With the release of new datasets in early 2021, this number increased to 574 h. The existing resources, however, are composed of audios containing only read and prepared speech. There is a lack of datasets including spontaneous speech, which are essential in several ASR applications. This paper presents CORAA (Corpus of Annotated Audios) ASR with 290 h, a publicly available dataset for ASR in BP containing validated pairs of audio-transcription. CORAA ASR also contains European Portuguese audios (4.6 h). We also present a public ASR model based on Wav2Vec 2.0 XLSR-53, fine-tuned over CORAA ASR. Our model achieved a Word Error Rate (WER) of 24.18% on CORAA ASR test set and 20.08% on Common Voice test set. When measuring the Character Error Rate (CER), we obtained 11.02% and 6.34% for CORAA ASR and Common Voice, respectively. CORAA ASR corpora were assembled to both improve ASR models in BP with phenomena from spontaneous speech and motivate young researchers to start their studies on ASR for Portuguese. All the corpora are publicly available at https://github.com/nilc-nlp/CORAA under the CC BY-NC-ND 4.0 license.</description><subject>Automatic speech recognition</subject><subject>Brazilian Portuguese</subject><subject>Computational Linguistics</subject><subject>Computer Science</subject><subject>Corpus linguistics</subject><subject>Datasets</subject><subject>Error analysis</subject><subject>Language and Literature</subject><subject>Linguistics</subject><subject>Original Paper</subject><subject>Portuguese language</subject><subject>Social Sciences</subject><subject>Speech</subject><subject>Speech recognition</subject><subject>Spontaneous speech</subject><subject>Test sets</subject><subject>Transcription</subject><subject>Voice recognition</subject><issn>1574-020X</issn><issn>1574-0218</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>7T9</sourceid><sourceid>AIMQZ</sourceid><sourceid>ALSLI</sourceid><sourceid>CPGLG</sourceid><sourceid>K50</sourceid><sourceid>M1D</sourceid><recordid>eNp9UMtKxDAUDaLgOPoDrgKuq0maNqm7cfAFAyOjgrtw26a1Q01q0grjwm83Tn3sXN3HecFB6JiSU0qIOPOUJCKLCGMRyVJGI76DJjQRPLyo3P3dydM-OvB-TQhnXMgJ-pgvV7MZnt2vzjHgFlytcWFdN3hsK-w7a3ow2oYTTIk7pztwugyA1sUzfgEzQNtu8Bu0TQl9QCrrflCnC1ubpm-swY3BFw7em7YBg--s64d60F4for0KWq-PvucUPV5dPsxvosXy-nY-W0RFTHkfScgITxPJYlbkmc6zMhYpS0vKaAV5ykVOM0nLJFwahCSZ4AzKSua8yHOekniKTkbfztnXENyrtR2cCZGKySSmLJZbFhtZhbPeO12pzjUv4DaKEvXVsxp7VqFnte1Z8SCKR5EPZFNr92f9j-oTyxGBZA</recordid><startdate>20230901</startdate><enddate>20230901</enddate><creator>Candido Junior, Arnaldo</creator><creator>Casanova, Edresson</creator><creator>Soares, Anderson</creator><creator>de Oliveira, Frederico Santos</creator><creator>Oliveira, Lucas</creator><creator>Junior, Ricardo Corso Fernandes</creator><creator>da Silva, Daniel Peixoto Pinto</creator><creator>Fayet, Fernando Gorgulho</creator><creator>Carlotto, Bruno Baldissera</creator><creator>Gris, Lucas Rafael Stefanel</creator><creator>Aluísio, Sandra Maria</creator><general>Springer Netherlands</general><general>Springer Nature B.V</general><scope>C6C</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7SC</scope><scope>7T9</scope><scope>7XB</scope><scope>8AL</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>8G5</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AIMQZ</scope><scope>ALSLI</scope><scope>ARAPS</scope><scope>AVQMV</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>CPGLG</scope><scope>CRLPW</scope><scope>DWQXO</scope><scope>GB0</scope><scope>GNUQQ</scope><scope>GUQSH</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K50</scope><scope>K7-</scope><scope>L7M</scope><scope>LIQON</scope><scope>L~C</scope><scope>L~D</scope><scope>M0N</scope><scope>M1D</scope><scope>M2O</scope><scope>MBDVC</scope><scope>P5Z</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>Q9U</scope><orcidid>https://orcid.org/0000-0002-2967-6077</orcidid><orcidid>https://orcid.org/0000-0002-5885-6747</orcidid><orcidid>https://orcid.org/0000-0001-5108-2630</orcidid><orcidid>https://orcid.org/0000-0002-5647-0891</orcidid><orcidid>https://orcid.org/0000-0003-0160-7173</orcidid></search><sort><creationdate>20230901</creationdate><title>CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese</title><author>Candido Junior, Arnaldo ; Casanova, Edresson ; Soares, Anderson ; de Oliveira, Frederico Santos ; Oliveira, Lucas ; Junior, Ricardo Corso Fernandes ; da Silva, Daniel Peixoto Pinto ; Fayet, Fernando Gorgulho ; Carlotto, Bruno Baldissera ; Gris, Lucas Rafael Stefanel ; Aluísio, Sandra Maria</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c314t-8a904658232cb9eb9d37626d121fab647b1981d51faea7809742adf8b4cbb4603</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Automatic speech recognition</topic><topic>Brazilian Portuguese</topic><topic>Computational Linguistics</topic><topic>Computer Science</topic><topic>Corpus linguistics</topic><topic>Datasets</topic><topic>Error analysis</topic><topic>Language and Literature</topic><topic>Linguistics</topic><topic>Original Paper</topic><topic>Portuguese language</topic><topic>Social Sciences</topic><topic>Speech</topic><topic>Speech recognition</topic><topic>Spontaneous speech</topic><topic>Test sets</topic><topic>Transcription</topic><topic>Voice recognition</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Candido Junior, Arnaldo</creatorcontrib><creatorcontrib>Casanova, Edresson</creatorcontrib><creatorcontrib>Soares, Anderson</creatorcontrib><creatorcontrib>de Oliveira, Frederico Santos</creatorcontrib><creatorcontrib>Oliveira, Lucas</creatorcontrib><creatorcontrib>Junior, Ricardo Corso Fernandes</creatorcontrib><creatorcontrib>da Silva, Daniel Peixoto Pinto</creatorcontrib><creatorcontrib>Fayet, Fernando Gorgulho</creatorcontrib><creatorcontrib>Carlotto, Bruno Baldissera</creatorcontrib><creatorcontrib>Gris, Lucas Rafael Stefanel</creatorcontrib><creatorcontrib>Aluísio, Sandra Maria</creatorcontrib><collection>SpringerOpen (Open Access)</collection><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Computer and Information Systems Abstracts</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Computing Database (Alumni Edition)</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>Research Library (Alumni Edition)</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest One Literature</collection><collection>Social Science Premium Collection</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>Arts Premium Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>Linguistics Collection</collection><collection>Linguistics Database</collection><collection>ProQuest Central Korea</collection><collection>DELNET Social Sciences &amp; Humanities Collection</collection><collection>ProQuest Central Student</collection><collection>Research Library Prep</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Art, Design and Architecture Collection</collection><collection>Computer Science Database</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>One Literature (ProQuest)</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Computing Database</collection><collection>Arts &amp; Humanities Database</collection><collection>ProQuest Research Library</collection><collection>Research Library (Corporate)</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central Basic</collection><jtitle>Language resources and evaluation</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Candido Junior, Arnaldo</au><au>Casanova, Edresson</au><au>Soares, Anderson</au><au>de Oliveira, Frederico Santos</au><au>Oliveira, Lucas</au><au>Junior, Ricardo Corso Fernandes</au><au>da Silva, Daniel Peixoto Pinto</au><au>Fayet, Fernando Gorgulho</au><au>Carlotto, Bruno Baldissera</au><au>Gris, Lucas Rafael Stefanel</au><au>Aluísio, Sandra Maria</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese</atitle><jtitle>Language resources and evaluation</jtitle><stitle>Lang Resources &amp; Evaluation</stitle><date>2023-09-01</date><risdate>2023</risdate><volume>57</volume><issue>3</issue><spage>1139</spage><epage>1171</epage><pages>1139-1171</pages><issn>1574-020X</issn><eissn>1574-0218</eissn><abstract>Automatic Speech recognition (ASR) is a complex and challenging task. In recent years, there have been significant advances in the area. In particular, for the Brazilian Portuguese (BP) language, there were around 376 h publicly available for the ASR task until the second half of 2020. With the release of new datasets in early 2021, this number increased to 574 h. The existing resources, however, are composed of audios containing only read and prepared speech. There is a lack of datasets including spontaneous speech, which are essential in several ASR applications. This paper presents CORAA (Corpus of Annotated Audios) ASR with 290 h, a publicly available dataset for ASR in BP containing validated pairs of audio-transcription. CORAA ASR also contains European Portuguese audios (4.6 h). We also present a public ASR model based on Wav2Vec 2.0 XLSR-53, fine-tuned over CORAA ASR. Our model achieved a Word Error Rate (WER) of 24.18% on CORAA ASR test set and 20.08% on Common Voice test set. When measuring the Character Error Rate (CER), we obtained 11.02% and 6.34% for CORAA ASR and Common Voice, respectively. CORAA ASR corpora were assembled to both improve ASR models in BP with phenomena from spontaneous speech and motivate young researchers to start their studies on ASR for Portuguese. All the corpora are publicly available at https://github.com/nilc-nlp/CORAA under the CC BY-NC-ND 4.0 license.</abstract><cop>Dordrecht</cop><pub>Springer Netherlands</pub><doi>10.1007/s10579-022-09621-4</doi><tpages>33</tpages><orcidid>https://orcid.org/0000-0002-2967-6077</orcidid><orcidid>https://orcid.org/0000-0002-5885-6747</orcidid><orcidid>https://orcid.org/0000-0001-5108-2630</orcidid><orcidid>https://orcid.org/0000-0002-5647-0891</orcidid><orcidid>https://orcid.org/0000-0003-0160-7173</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1574-020X
ispartof Language resources and evaluation, 2023-09, Vol.57 (3), p.1139-1171
issn 1574-020X
1574-0218
language eng
recordid cdi_proquest_journals_2853123860
source Art, Design and Architecture Collection; Social Science Premium Collection; Springer Nature; Linguistics Collection; ProQuest One Literature; Linguistics and Language Behavior Abstracts (LLBA)
subjects Automatic speech recognition
Brazilian Portuguese
Computational Linguistics
Computer Science
Corpus linguistics
Datasets
Error analysis
Language and Literature
Linguistics
Original Paper
Portuguese language
Social Sciences
Speech
Speech recognition
Spontaneous speech
Test sets
Transcription
Voice recognition
title CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T20%3A32%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=CORAA%20ASR:%20a%20large%20corpus%20of%20spontaneous%20and%20prepared%20speech%20manually%20validated%20for%20speech%20recognition%20in%20Brazilian%20Portuguese&rft.jtitle=Language%20resources%20and%20evaluation&rft.au=Candido%20Junior,%20Arnaldo&rft.date=2023-09-01&rft.volume=57&rft.issue=3&rft.spage=1139&rft.epage=1171&rft.pages=1139-1171&rft.issn=1574-020X&rft.eissn=1574-0218&rft_id=info:doi/10.1007/s10579-022-09621-4&rft_dat=%3Cproquest_cross%3E2853123860%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c314t-8a904658232cb9eb9d37626d121fab647b1981d51faea7809742adf8b4cbb4603%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2853123860&rft_id=info:pmid/&rfr_iscdi=true