Loading…

IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages

We present INDICVOICES, a dataset of natural and spontaneous speech containing a total of 7348 hours of read (9%), extempore (74%) and conversational (17%) audio from 16237 speakers covering 145 Indian districts and 22 languages. Of these 7348 hours, 1639 hours have already been transcribed, with a...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2024-03
Main Authors: Tahir Javed, Janki Atul Nawale, Eldho Ittan George, Joshi, Sakshi, Kaushal, Santosh Bhogale, Mehendale, Deovrat, Sethi, Ishvinder Virender, Ananthanarayanan, Aparna, Faquih, Hafsah, Palit, Pratiti, Ravishankar, Sneha, Sukumaran, Saranya, Panchagnula, Tripura, Sunjay Murali, Gandhi, Kunal Sharad, Ambujavalli, R, Manickam, K M, Vaijayanthi, C Venkata, Krishnan Srinivasa Raghavan Karunganni, Kumar, Pratyush, Khapra, Mitesh M
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Tahir Javed
Janki Atul Nawale
Eldho Ittan George
Joshi, Sakshi
Kaushal, Santosh Bhogale
Mehendale, Deovrat
Sethi, Ishvinder Virender
Ananthanarayanan, Aparna
Faquih, Hafsah
Palit, Pratiti
Ravishankar, Sneha
Sukumaran, Saranya
Panchagnula, Tripura
Sunjay Murali
Gandhi, Kunal Sharad
Ambujavalli, R
Manickam, K M
Vaijayanthi, C Venkata
Krishnan Srinivasa Raghavan Karunganni
Kumar, Pratyush
Khapra, Mitesh M
description We present INDICVOICES, a dataset of natural and spontaneous speech containing a total of 7348 hours of read (9%), extempore (74%) and conversational (17%) audio from 16237 speakers covering 145 Indian districts and 22 languages. Of these 7348 hours, 1639 hours have already been transcribed, with a median of 73 hours per language. Through this paper, we share our journey of capturing the cultural, linguistic and demographic diversity of India to create a one-of-its-kind inclusive and representative dataset. More specifically, we share an open-source blueprint for data collection at scale comprising of standardised protocols, centralised tools, a repository of engaging questions, prompts and conversation scenarios spanning multiple domains and topics of interest, quality control mechanisms, comprehensive transcription guidelines and transcription tools. We hope that this open source blueprint will serve as a comprehensive starter kit for data collection efforts in other multilingual regions of the world. Using INDICVOICES, we build IndicASR, the first ASR model to support all the 22 languages listed in the 8th schedule of the Constitution of India. All the data, tools, guidelines, models and other materials developed as a part of this work will be made publicly available
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2937457884</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2937457884</sourcerecordid><originalsourceid>FETCH-proquest_journals_29374578843</originalsourceid><addsrcrecordid>eNqNjM0KgkAUhYcgSMp3uNBasBlNa9sPCbVK2rSQSUcbGWZsrlOvn0IP0OrAd75zJsSjjK2CNKJ0RnzENgxDuk5oHDOP3DNdyfJmZClwC7n5cFshPJxUldQNcA2ZLpVD-RZwcaqXasCOK7h2QpRP2POeo-ihNhbGq2Fw5qPRCFyQac0VCv-Xc7I8HvLdKeiseTmBfdEaZ_VQFXTDkihO0jRi_1lfc8xDFQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2937457884</pqid></control><display><type>article</type><title>IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages</title><source>Publicly Available Content Database</source><creator>Tahir Javed ; Janki Atul Nawale ; Eldho Ittan George ; Joshi, Sakshi ; Kaushal, Santosh Bhogale ; Mehendale, Deovrat ; Sethi, Ishvinder Virender ; Ananthanarayanan, Aparna ; Faquih, Hafsah ; Palit, Pratiti ; Ravishankar, Sneha ; Sukumaran, Saranya ; Panchagnula, Tripura ; Sunjay Murali ; Gandhi, Kunal Sharad ; Ambujavalli, R ; Manickam, K M ; Vaijayanthi, C Venkata ; Krishnan Srinivasa Raghavan Karunganni ; Kumar, Pratyush ; Khapra, Mitesh M</creator><creatorcontrib>Tahir Javed ; Janki Atul Nawale ; Eldho Ittan George ; Joshi, Sakshi ; Kaushal, Santosh Bhogale ; Mehendale, Deovrat ; Sethi, Ishvinder Virender ; Ananthanarayanan, Aparna ; Faquih, Hafsah ; Palit, Pratiti ; Ravishankar, Sneha ; Sukumaran, Saranya ; Panchagnula, Tripura ; Sunjay Murali ; Gandhi, Kunal Sharad ; Ambujavalli, R ; Manickam, K M ; Vaijayanthi, C Venkata ; Krishnan Srinivasa Raghavan Karunganni ; Kumar, Pratyush ; Khapra, Mitesh M</creatorcontrib><description>We present INDICVOICES, a dataset of natural and spontaneous speech containing a total of 7348 hours of read (9%), extempore (74%) and conversational (17%) audio from 16237 speakers covering 145 Indian districts and 22 languages. Of these 7348 hours, 1639 hours have already been transcribed, with a median of 73 hours per language. Through this paper, we share our journey of capturing the cultural, linguistic and demographic diversity of India to create a one-of-its-kind inclusive and representative dataset. More specifically, we share an open-source blueprint for data collection at scale comprising of standardised protocols, centralised tools, a repository of engaging questions, prompts and conversation scenarios spanning multiple domains and topics of interest, quality control mechanisms, comprehensive transcription guidelines and transcription tools. We hope that this open source blueprint will serve as a comprehensive starter kit for data collection efforts in other multilingual regions of the world. Using INDICVOICES, we build IndicASR, the first ASR model to support all the 22 languages listed in the 8th schedule of the Constitution of India. All the data, tools, guidelines, models and other materials developed as a part of this work will be made publicly available</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Data collection ; Datasets ; Guidelines ; Languages ; Multilingualism ; Quality control ; Speech</subject><ispartof>arXiv.org, 2024-03</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2937457884?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Tahir Javed</creatorcontrib><creatorcontrib>Janki Atul Nawale</creatorcontrib><creatorcontrib>Eldho Ittan George</creatorcontrib><creatorcontrib>Joshi, Sakshi</creatorcontrib><creatorcontrib>Kaushal, Santosh Bhogale</creatorcontrib><creatorcontrib>Mehendale, Deovrat</creatorcontrib><creatorcontrib>Sethi, Ishvinder Virender</creatorcontrib><creatorcontrib>Ananthanarayanan, Aparna</creatorcontrib><creatorcontrib>Faquih, Hafsah</creatorcontrib><creatorcontrib>Palit, Pratiti</creatorcontrib><creatorcontrib>Ravishankar, Sneha</creatorcontrib><creatorcontrib>Sukumaran, Saranya</creatorcontrib><creatorcontrib>Panchagnula, Tripura</creatorcontrib><creatorcontrib>Sunjay Murali</creatorcontrib><creatorcontrib>Gandhi, Kunal Sharad</creatorcontrib><creatorcontrib>Ambujavalli, R</creatorcontrib><creatorcontrib>Manickam, K M</creatorcontrib><creatorcontrib>Vaijayanthi, C Venkata</creatorcontrib><creatorcontrib>Krishnan Srinivasa Raghavan Karunganni</creatorcontrib><creatorcontrib>Kumar, Pratyush</creatorcontrib><creatorcontrib>Khapra, Mitesh M</creatorcontrib><title>IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages</title><title>arXiv.org</title><description>We present INDICVOICES, a dataset of natural and spontaneous speech containing a total of 7348 hours of read (9%), extempore (74%) and conversational (17%) audio from 16237 speakers covering 145 Indian districts and 22 languages. Of these 7348 hours, 1639 hours have already been transcribed, with a median of 73 hours per language. Through this paper, we share our journey of capturing the cultural, linguistic and demographic diversity of India to create a one-of-its-kind inclusive and representative dataset. More specifically, we share an open-source blueprint for data collection at scale comprising of standardised protocols, centralised tools, a repository of engaging questions, prompts and conversation scenarios spanning multiple domains and topics of interest, quality control mechanisms, comprehensive transcription guidelines and transcription tools. We hope that this open source blueprint will serve as a comprehensive starter kit for data collection efforts in other multilingual regions of the world. Using INDICVOICES, we build IndicASR, the first ASR model to support all the 22 languages listed in the 8th schedule of the Constitution of India. All the data, tools, guidelines, models and other materials developed as a part of this work will be made publicly available</description><subject>Data collection</subject><subject>Datasets</subject><subject>Guidelines</subject><subject>Languages</subject><subject>Multilingualism</subject><subject>Quality control</subject><subject>Speech</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNjM0KgkAUhYcgSMp3uNBasBlNa9sPCbVK2rSQSUcbGWZsrlOvn0IP0OrAd75zJsSjjK2CNKJ0RnzENgxDuk5oHDOP3DNdyfJmZClwC7n5cFshPJxUldQNcA2ZLpVD-RZwcaqXasCOK7h2QpRP2POeo-ihNhbGq2Fw5qPRCFyQac0VCv-Xc7I8HvLdKeiseTmBfdEaZ_VQFXTDkihO0jRi_1lfc8xDFQ</recordid><startdate>20240304</startdate><enddate>20240304</enddate><creator>Tahir Javed</creator><creator>Janki Atul Nawale</creator><creator>Eldho Ittan George</creator><creator>Joshi, Sakshi</creator><creator>Kaushal, Santosh Bhogale</creator><creator>Mehendale, Deovrat</creator><creator>Sethi, Ishvinder Virender</creator><creator>Ananthanarayanan, Aparna</creator><creator>Faquih, Hafsah</creator><creator>Palit, Pratiti</creator><creator>Ravishankar, Sneha</creator><creator>Sukumaran, Saranya</creator><creator>Panchagnula, Tripura</creator><creator>Sunjay Murali</creator><creator>Gandhi, Kunal Sharad</creator><creator>Ambujavalli, R</creator><creator>Manickam, K M</creator><creator>Vaijayanthi, C Venkata</creator><creator>Krishnan Srinivasa Raghavan Karunganni</creator><creator>Kumar, Pratyush</creator><creator>Khapra, Mitesh M</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240304</creationdate><title>IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages</title><author>Tahir Javed ; Janki Atul Nawale ; Eldho Ittan George ; Joshi, Sakshi ; Kaushal, Santosh Bhogale ; Mehendale, Deovrat ; Sethi, Ishvinder Virender ; Ananthanarayanan, Aparna ; Faquih, Hafsah ; Palit, Pratiti ; Ravishankar, Sneha ; Sukumaran, Saranya ; Panchagnula, Tripura ; Sunjay Murali ; Gandhi, Kunal Sharad ; Ambujavalli, R ; Manickam, K M ; Vaijayanthi, C Venkata ; Krishnan Srinivasa Raghavan Karunganni ; Kumar, Pratyush ; Khapra, Mitesh M</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_29374578843</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Data collection</topic><topic>Datasets</topic><topic>Guidelines</topic><topic>Languages</topic><topic>Multilingualism</topic><topic>Quality control</topic><topic>Speech</topic><toplevel>online_resources</toplevel><creatorcontrib>Tahir Javed</creatorcontrib><creatorcontrib>Janki Atul Nawale</creatorcontrib><creatorcontrib>Eldho Ittan George</creatorcontrib><creatorcontrib>Joshi, Sakshi</creatorcontrib><creatorcontrib>Kaushal, Santosh Bhogale</creatorcontrib><creatorcontrib>Mehendale, Deovrat</creatorcontrib><creatorcontrib>Sethi, Ishvinder Virender</creatorcontrib><creatorcontrib>Ananthanarayanan, Aparna</creatorcontrib><creatorcontrib>Faquih, Hafsah</creatorcontrib><creatorcontrib>Palit, Pratiti</creatorcontrib><creatorcontrib>Ravishankar, Sneha</creatorcontrib><creatorcontrib>Sukumaran, Saranya</creatorcontrib><creatorcontrib>Panchagnula, Tripura</creatorcontrib><creatorcontrib>Sunjay Murali</creatorcontrib><creatorcontrib>Gandhi, Kunal Sharad</creatorcontrib><creatorcontrib>Ambujavalli, R</creatorcontrib><creatorcontrib>Manickam, K M</creatorcontrib><creatorcontrib>Vaijayanthi, C Venkata</creatorcontrib><creatorcontrib>Krishnan Srinivasa Raghavan Karunganni</creatorcontrib><creatorcontrib>Kumar, Pratyush</creatorcontrib><creatorcontrib>Khapra, Mitesh M</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Tahir Javed</au><au>Janki Atul Nawale</au><au>Eldho Ittan George</au><au>Joshi, Sakshi</au><au>Kaushal, Santosh Bhogale</au><au>Mehendale, Deovrat</au><au>Sethi, Ishvinder Virender</au><au>Ananthanarayanan, Aparna</au><au>Faquih, Hafsah</au><au>Palit, Pratiti</au><au>Ravishankar, Sneha</au><au>Sukumaran, Saranya</au><au>Panchagnula, Tripura</au><au>Sunjay Murali</au><au>Gandhi, Kunal Sharad</au><au>Ambujavalli, R</au><au>Manickam, K M</au><au>Vaijayanthi, C Venkata</au><au>Krishnan Srinivasa Raghavan Karunganni</au><au>Kumar, Pratyush</au><au>Khapra, Mitesh M</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages</atitle><jtitle>arXiv.org</jtitle><date>2024-03-04</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>We present INDICVOICES, a dataset of natural and spontaneous speech containing a total of 7348 hours of read (9%), extempore (74%) and conversational (17%) audio from 16237 speakers covering 145 Indian districts and 22 languages. Of these 7348 hours, 1639 hours have already been transcribed, with a median of 73 hours per language. Through this paper, we share our journey of capturing the cultural, linguistic and demographic diversity of India to create a one-of-its-kind inclusive and representative dataset. More specifically, we share an open-source blueprint for data collection at scale comprising of standardised protocols, centralised tools, a repository of engaging questions, prompts and conversation scenarios spanning multiple domains and topics of interest, quality control mechanisms, comprehensive transcription guidelines and transcription tools. We hope that this open source blueprint will serve as a comprehensive starter kit for data collection efforts in other multilingual regions of the world. Using INDICVOICES, we build IndicASR, the first ASR model to support all the 22 languages listed in the 8th schedule of the Constitution of India. All the data, tools, guidelines, models and other materials developed as a part of this work will be made publicly available</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-03
issn 2331-8422
language eng
recordid cdi_proquest_journals_2937457884
source Publicly Available Content Database
subjects Data collection
Datasets
Guidelines
Languages
Multilingualism
Quality control
Speech
title IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T06%3A44%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=IndicVoices:%20Towards%20building%20an%20Inclusive%20Multilingual%20Speech%20Dataset%20for%20Indian%20Languages&rft.jtitle=arXiv.org&rft.au=Tahir%20Javed&rft.date=2024-03-04&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2937457884%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_29374578843%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2937457884&rft_id=info:pmid/&rfr_iscdi=true