Loading…
IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages
We present INDICVOICES, a dataset of natural and spontaneous speech containing a total of 7348 hours of read (9%), extempore (74%) and conversational (17%) audio from 16237 speakers covering 145 Indian districts and 22 languages. Of these 7348 hours, 1639 hours have already been transcribed, with a...
Saved in:
Published in: | arXiv.org 2024-03 |
---|---|
Main Authors: | , , , , , , , , , , , , , , , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | |
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Tahir Javed Janki Atul Nawale Eldho Ittan George Joshi, Sakshi Kaushal, Santosh Bhogale Mehendale, Deovrat Sethi, Ishvinder Virender Ananthanarayanan, Aparna Faquih, Hafsah Palit, Pratiti Ravishankar, Sneha Sukumaran, Saranya Panchagnula, Tripura Sunjay Murali Gandhi, Kunal Sharad Ambujavalli, R Manickam, K M Vaijayanthi, C Venkata Krishnan Srinivasa Raghavan Karunganni Kumar, Pratyush Khapra, Mitesh M |
description | We present INDICVOICES, a dataset of natural and spontaneous speech containing a total of 7348 hours of read (9%), extempore (74%) and conversational (17%) audio from 16237 speakers covering 145 Indian districts and 22 languages. Of these 7348 hours, 1639 hours have already been transcribed, with a median of 73 hours per language. Through this paper, we share our journey of capturing the cultural, linguistic and demographic diversity of India to create a one-of-its-kind inclusive and representative dataset. More specifically, we share an open-source blueprint for data collection at scale comprising of standardised protocols, centralised tools, a repository of engaging questions, prompts and conversation scenarios spanning multiple domains and topics of interest, quality control mechanisms, comprehensive transcription guidelines and transcription tools. We hope that this open source blueprint will serve as a comprehensive starter kit for data collection efforts in other multilingual regions of the world. Using INDICVOICES, we build IndicASR, the first ASR model to support all the 22 languages listed in the 8th schedule of the Constitution of India. All the data, tools, guidelines, models and other materials developed as a part of this work will be made publicly available |
format | article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2937457884</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2937457884</sourcerecordid><originalsourceid>FETCH-proquest_journals_29374578843</originalsourceid><addsrcrecordid>eNqNjM0KgkAUhYcgSMp3uNBasBlNa9sPCbVK2rSQSUcbGWZsrlOvn0IP0OrAd75zJsSjjK2CNKJ0RnzENgxDuk5oHDOP3DNdyfJmZClwC7n5cFshPJxUldQNcA2ZLpVD-RZwcaqXasCOK7h2QpRP2POeo-ihNhbGq2Fw5qPRCFyQac0VCv-Xc7I8HvLdKeiseTmBfdEaZ_VQFXTDkihO0jRi_1lfc8xDFQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2937457884</pqid></control><display><type>article</type><title>IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages</title><source>Publicly Available Content Database</source><creator>Tahir Javed ; Janki Atul Nawale ; Eldho Ittan George ; Joshi, Sakshi ; Kaushal, Santosh Bhogale ; Mehendale, Deovrat ; Sethi, Ishvinder Virender ; Ananthanarayanan, Aparna ; Faquih, Hafsah ; Palit, Pratiti ; Ravishankar, Sneha ; Sukumaran, Saranya ; Panchagnula, Tripura ; Sunjay Murali ; Gandhi, Kunal Sharad ; Ambujavalli, R ; Manickam, K M ; Vaijayanthi, C Venkata ; Krishnan Srinivasa Raghavan Karunganni ; Kumar, Pratyush ; Khapra, Mitesh M</creator><creatorcontrib>Tahir Javed ; Janki Atul Nawale ; Eldho Ittan George ; Joshi, Sakshi ; Kaushal, Santosh Bhogale ; Mehendale, Deovrat ; Sethi, Ishvinder Virender ; Ananthanarayanan, Aparna ; Faquih, Hafsah ; Palit, Pratiti ; Ravishankar, Sneha ; Sukumaran, Saranya ; Panchagnula, Tripura ; Sunjay Murali ; Gandhi, Kunal Sharad ; Ambujavalli, R ; Manickam, K M ; Vaijayanthi, C Venkata ; Krishnan Srinivasa Raghavan Karunganni ; Kumar, Pratyush ; Khapra, Mitesh M</creatorcontrib><description>We present INDICVOICES, a dataset of natural and spontaneous speech containing a total of 7348 hours of read (9%), extempore (74%) and conversational (17%) audio from 16237 speakers covering 145 Indian districts and 22 languages. Of these 7348 hours, 1639 hours have already been transcribed, with a median of 73 hours per language. Through this paper, we share our journey of capturing the cultural, linguistic and demographic diversity of India to create a one-of-its-kind inclusive and representative dataset. More specifically, we share an open-source blueprint for data collection at scale comprising of standardised protocols, centralised tools, a repository of engaging questions, prompts and conversation scenarios spanning multiple domains and topics of interest, quality control mechanisms, comprehensive transcription guidelines and transcription tools. We hope that this open source blueprint will serve as a comprehensive starter kit for data collection efforts in other multilingual regions of the world. Using INDICVOICES, we build IndicASR, the first ASR model to support all the 22 languages listed in the 8th schedule of the Constitution of India. All the data, tools, guidelines, models and other materials developed as a part of this work will be made publicly available</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Data collection ; Datasets ; Guidelines ; Languages ; Multilingualism ; Quality control ; Speech</subject><ispartof>arXiv.org, 2024-03</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2937457884?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Tahir Javed</creatorcontrib><creatorcontrib>Janki Atul Nawale</creatorcontrib><creatorcontrib>Eldho Ittan George</creatorcontrib><creatorcontrib>Joshi, Sakshi</creatorcontrib><creatorcontrib>Kaushal, Santosh Bhogale</creatorcontrib><creatorcontrib>Mehendale, Deovrat</creatorcontrib><creatorcontrib>Sethi, Ishvinder Virender</creatorcontrib><creatorcontrib>Ananthanarayanan, Aparna</creatorcontrib><creatorcontrib>Faquih, Hafsah</creatorcontrib><creatorcontrib>Palit, Pratiti</creatorcontrib><creatorcontrib>Ravishankar, Sneha</creatorcontrib><creatorcontrib>Sukumaran, Saranya</creatorcontrib><creatorcontrib>Panchagnula, Tripura</creatorcontrib><creatorcontrib>Sunjay Murali</creatorcontrib><creatorcontrib>Gandhi, Kunal Sharad</creatorcontrib><creatorcontrib>Ambujavalli, R</creatorcontrib><creatorcontrib>Manickam, K M</creatorcontrib><creatorcontrib>Vaijayanthi, C Venkata</creatorcontrib><creatorcontrib>Krishnan Srinivasa Raghavan Karunganni</creatorcontrib><creatorcontrib>Kumar, Pratyush</creatorcontrib><creatorcontrib>Khapra, Mitesh M</creatorcontrib><title>IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages</title><title>arXiv.org</title><description>We present INDICVOICES, a dataset of natural and spontaneous speech containing a total of 7348 hours of read (9%), extempore (74%) and conversational (17%) audio from 16237 speakers covering 145 Indian districts and 22 languages. Of these 7348 hours, 1639 hours have already been transcribed, with a median of 73 hours per language. Through this paper, we share our journey of capturing the cultural, linguistic and demographic diversity of India to create a one-of-its-kind inclusive and representative dataset. More specifically, we share an open-source blueprint for data collection at scale comprising of standardised protocols, centralised tools, a repository of engaging questions, prompts and conversation scenarios spanning multiple domains and topics of interest, quality control mechanisms, comprehensive transcription guidelines and transcription tools. We hope that this open source blueprint will serve as a comprehensive starter kit for data collection efforts in other multilingual regions of the world. Using INDICVOICES, we build IndicASR, the first ASR model to support all the 22 languages listed in the 8th schedule of the Constitution of India. All the data, tools, guidelines, models and other materials developed as a part of this work will be made publicly available</description><subject>Data collection</subject><subject>Datasets</subject><subject>Guidelines</subject><subject>Languages</subject><subject>Multilingualism</subject><subject>Quality control</subject><subject>Speech</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNjM0KgkAUhYcgSMp3uNBasBlNa9sPCbVK2rSQSUcbGWZsrlOvn0IP0OrAd75zJsSjjK2CNKJ0RnzENgxDuk5oHDOP3DNdyfJmZClwC7n5cFshPJxUldQNcA2ZLpVD-RZwcaqXasCOK7h2QpRP2POeo-ihNhbGq2Fw5qPRCFyQac0VCv-Xc7I8HvLdKeiseTmBfdEaZ_VQFXTDkihO0jRi_1lfc8xDFQ</recordid><startdate>20240304</startdate><enddate>20240304</enddate><creator>Tahir Javed</creator><creator>Janki Atul Nawale</creator><creator>Eldho Ittan George</creator><creator>Joshi, Sakshi</creator><creator>Kaushal, Santosh Bhogale</creator><creator>Mehendale, Deovrat</creator><creator>Sethi, Ishvinder Virender</creator><creator>Ananthanarayanan, Aparna</creator><creator>Faquih, Hafsah</creator><creator>Palit, Pratiti</creator><creator>Ravishankar, Sneha</creator><creator>Sukumaran, Saranya</creator><creator>Panchagnula, Tripura</creator><creator>Sunjay Murali</creator><creator>Gandhi, Kunal Sharad</creator><creator>Ambujavalli, R</creator><creator>Manickam, K M</creator><creator>Vaijayanthi, C Venkata</creator><creator>Krishnan Srinivasa Raghavan Karunganni</creator><creator>Kumar, Pratyush</creator><creator>Khapra, Mitesh M</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240304</creationdate><title>IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages</title><author>Tahir Javed ; Janki Atul Nawale ; Eldho Ittan George ; Joshi, Sakshi ; Kaushal, Santosh Bhogale ; Mehendale, Deovrat ; Sethi, Ishvinder Virender ; Ananthanarayanan, Aparna ; Faquih, Hafsah ; Palit, Pratiti ; Ravishankar, Sneha ; Sukumaran, Saranya ; Panchagnula, Tripura ; Sunjay Murali ; Gandhi, Kunal Sharad ; Ambujavalli, R ; Manickam, K M ; Vaijayanthi, C Venkata ; Krishnan Srinivasa Raghavan Karunganni ; Kumar, Pratyush ; Khapra, Mitesh M</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_29374578843</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Data collection</topic><topic>Datasets</topic><topic>Guidelines</topic><topic>Languages</topic><topic>Multilingualism</topic><topic>Quality control</topic><topic>Speech</topic><toplevel>online_resources</toplevel><creatorcontrib>Tahir Javed</creatorcontrib><creatorcontrib>Janki Atul Nawale</creatorcontrib><creatorcontrib>Eldho Ittan George</creatorcontrib><creatorcontrib>Joshi, Sakshi</creatorcontrib><creatorcontrib>Kaushal, Santosh Bhogale</creatorcontrib><creatorcontrib>Mehendale, Deovrat</creatorcontrib><creatorcontrib>Sethi, Ishvinder Virender</creatorcontrib><creatorcontrib>Ananthanarayanan, Aparna</creatorcontrib><creatorcontrib>Faquih, Hafsah</creatorcontrib><creatorcontrib>Palit, Pratiti</creatorcontrib><creatorcontrib>Ravishankar, Sneha</creatorcontrib><creatorcontrib>Sukumaran, Saranya</creatorcontrib><creatorcontrib>Panchagnula, Tripura</creatorcontrib><creatorcontrib>Sunjay Murali</creatorcontrib><creatorcontrib>Gandhi, Kunal Sharad</creatorcontrib><creatorcontrib>Ambujavalli, R</creatorcontrib><creatorcontrib>Manickam, K M</creatorcontrib><creatorcontrib>Vaijayanthi, C Venkata</creatorcontrib><creatorcontrib>Krishnan Srinivasa Raghavan Karunganni</creatorcontrib><creatorcontrib>Kumar, Pratyush</creatorcontrib><creatorcontrib>Khapra, Mitesh M</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Tahir Javed</au><au>Janki Atul Nawale</au><au>Eldho Ittan George</au><au>Joshi, Sakshi</au><au>Kaushal, Santosh Bhogale</au><au>Mehendale, Deovrat</au><au>Sethi, Ishvinder Virender</au><au>Ananthanarayanan, Aparna</au><au>Faquih, Hafsah</au><au>Palit, Pratiti</au><au>Ravishankar, Sneha</au><au>Sukumaran, Saranya</au><au>Panchagnula, Tripura</au><au>Sunjay Murali</au><au>Gandhi, Kunal Sharad</au><au>Ambujavalli, R</au><au>Manickam, K M</au><au>Vaijayanthi, C Venkata</au><au>Krishnan Srinivasa Raghavan Karunganni</au><au>Kumar, Pratyush</au><au>Khapra, Mitesh M</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages</atitle><jtitle>arXiv.org</jtitle><date>2024-03-04</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>We present INDICVOICES, a dataset of natural and spontaneous speech containing a total of 7348 hours of read (9%), extempore (74%) and conversational (17%) audio from 16237 speakers covering 145 Indian districts and 22 languages. Of these 7348 hours, 1639 hours have already been transcribed, with a median of 73 hours per language. Through this paper, we share our journey of capturing the cultural, linguistic and demographic diversity of India to create a one-of-its-kind inclusive and representative dataset. More specifically, we share an open-source blueprint for data collection at scale comprising of standardised protocols, centralised tools, a repository of engaging questions, prompts and conversation scenarios spanning multiple domains and topics of interest, quality control mechanisms, comprehensive transcription guidelines and transcription tools. We hope that this open source blueprint will serve as a comprehensive starter kit for data collection efforts in other multilingual regions of the world. Using INDICVOICES, we build IndicASR, the first ASR model to support all the 22 languages listed in the 8th schedule of the Constitution of India. All the data, tools, guidelines, models and other materials developed as a part of this work will be made publicly available</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-03 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2937457884 |
source | Publicly Available Content Database |
subjects | Data collection Datasets Guidelines Languages Multilingualism Quality control Speech |
title | IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T06%3A44%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=IndicVoices:%20Towards%20building%20an%20Inclusive%20Multilingual%20Speech%20Dataset%20for%20Indian%20Languages&rft.jtitle=arXiv.org&rft.au=Tahir%20Javed&rft.date=2024-03-04&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2937457884%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_29374578843%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2937457884&rft_id=info:pmid/&rfr_iscdi=true |