Loading…

The Impact of Translating Resource-Rich Datasets to Low-Resource Languages Through Multi-Lingual Text Processing

Urdu is still considered a low-resource language despite being ranked as world's 10^{th} most spoken language with nearly 230 million speakers. The scarcity of benchmark datasets in low-resource languages has led researchers to utilize more ingenious techniques to curb the issue. One such opt...

Full description

Saved in:
Bibliographic Details
Published in:IEEE access 2021, Vol.9, p.124478-124490
Main Authors: Ghafoor, Abdul, Imran, Ali Shariq, Daudpota, Sher Muhammad, Kastrati, Zenun, Abdullah, Batra, Rakhi, Wani, Mudasir Ahmad
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c446t-6caac4f6daddb58c323f1869b4d2accf6bfd021bddacc99338fd5939bbd5cff83
cites cdi_FETCH-LOGICAL-c446t-6caac4f6daddb58c323f1869b4d2accf6bfd021bddacc99338fd5939bbd5cff83
container_end_page 124490
container_issue
container_start_page 124478
container_title IEEE access
container_volume 9
creator Ghafoor, Abdul
Imran, Ali Shariq
Daudpota, Sher Muhammad
Kastrati, Zenun
Abdullah
Batra, Rakhi
Wani, Mudasir Ahmad
description Urdu is still considered a low-resource language despite being ranked as world's 10^{th} most spoken language with nearly 230 million speakers. The scarcity of benchmark datasets in low-resource languages has led researchers to utilize more ingenious techniques to curb the issue. One such option widely adopted is to use language translation services to replicate existing datasets from resource-rich languages such as English to low-resource languages, such as Urdu. For most natural language processing tasks, including polarity assessment, words translated via Google translator from one language to another often change the meaning. It results in a polarity shift causing the system's performance degradation, particularly for sentiment classification and emotion detection tasks. This study evaluates the effect of translation on the sentiment classification task from a resource-rich language to a low-resource language. It identifies and enlists words causing polarity shift into five distinct categories. It further finds the correlation between the language with similar roots. Our study shows 2-3 percentage points performance degradation in sentiment classification due to polarity shift as a result of translation from resource-rich languages to low-resource languages.
doi_str_mv 10.1109/ACCESS.2021.3110285
format article
fullrecord <record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_ieee_primary_9529190</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9529190</ieee_id><doaj_id>oai_doaj_org_article_cb57af4e422f4355ba6f88e373b735e9</doaj_id><sourcerecordid>2572667713</sourcerecordid><originalsourceid>FETCH-LOGICAL-c446t-6caac4f6daddb58c323f1869b4d2accf6bfd021bddacc99338fd5939bbd5cff83</originalsourceid><addsrcrecordid>eNpVkU1v3CAYhK2qlRql-QW5IOXsrQGDzXG1SduVXLVK3F4Rn15WjnEAK82_DxunUcsFGGYe6WWK4hJWGwgr9nm7293c3W1QheAGZwW15F1xhiBlJSaYvv_n_LG4iPFY5dVmiTRnxdwfDNjfz0Il4C3og5jiKJKbBnBrol-CMuWtUwdwLZKIJkWQPOj8Y_n3FXRiGhYxmAj6Q_DLcADflzG5snMnfQS9-ZPAz-CViTFLn4oPVozRXLzu58WvLzf97lvZ_fi63227UtU1TSVVQqjaUi20lqRVGGELW8pkrZFQylJpdR5Yap1vjGHcWk0YZlJqoqxt8XmxX7naiyOfg7sX4Yl74fiL4MPARUhOjYYrSRpha1MjZGtMiBTUtq3BDZYNJoZlVrmy4qOZF_kf7dr93r7QxmnhsKKspdl_tfrn4B8WExM_5r-a8rgckQZR2jQQZxdeXSr4GIOxb1xY8VOzfG2Wn5rlr83m1OWacsaYtwQjiEFW4We0xaGv</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2572667713</pqid></control><display><type>article</type><title>The Impact of Translating Resource-Rich Datasets to Low-Resource Languages Through Multi-Lingual Text Processing</title><source>Linguistics and Language Behavior Abstracts (LLBA)</source><source>IEEE Xplore Open Access Journals</source><creator>Ghafoor, Abdul ; Imran, Ali Shariq ; Daudpota, Sher Muhammad ; Kastrati, Zenun ; Abdullah ; Batra, Rakhi ; Wani, Mudasir Ahmad</creator><creatorcontrib>Ghafoor, Abdul ; Imran, Ali Shariq ; Daudpota, Sher Muhammad ; Kastrati, Zenun ; Abdullah ; Batra, Rakhi ; Wani, Mudasir Ahmad</creatorcontrib><description>Urdu is still considered a low-resource language despite being ranked as world's &lt;inline-formula&gt; &lt;tex-math notation="LaTeX"&gt;10^{th} &lt;/tex-math&gt;&lt;/inline-formula&gt; most spoken language with nearly 230 million speakers. The scarcity of benchmark datasets in low-resource languages has led researchers to utilize more ingenious techniques to curb the issue. One such option widely adopted is to use language translation services to replicate existing datasets from resource-rich languages such as English to low-resource languages, such as Urdu. For most natural language processing tasks, including polarity assessment, words translated via Google translator from one language to another often change the meaning. It results in a polarity shift causing the system's performance degradation, particularly for sentiment classification and emotion detection tasks. This study evaluates the effect of translation on the sentiment classification task from a resource-rich language to a low-resource language. It identifies and enlists words causing polarity shift into five distinct categories. It further finds the correlation between the language with similar roots. Our study shows 2-3 percentage points performance degradation in sentiment classification due to polarity shift as a result of translation from resource-rich languages to low-resource languages.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2021.3110285</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Analytical models ; BiLSTM ; Classification ; Conv1D ; Data models ; Datasets ; English ; German ; Hindi ; Informatik ; Information Systems ; Internet ; Language ; Language translation ; low resource language ; Machine translation ; Multilingual text processing ; Natural language processing ; Performance degradation ; Polarity ; polarity assessment ; Sentiment analysis ; sentiment classification ; Task analysis ; Text processing ; Translating ; Translation ; Translators ; Urdu ; Urdu language ; Word processing ; Words (language)</subject><ispartof>IEEE access, 2021, Vol.9, p.124478-124490</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c446t-6caac4f6daddb58c323f1869b4d2accf6bfd021bddacc99338fd5939bbd5cff83</citedby><cites>FETCH-LOGICAL-c446t-6caac4f6daddb58c323f1869b4d2accf6bfd021bddacc99338fd5939bbd5cff83</cites><orcidid>0000-0002-2416-2878 ; 0000-0001-6684-751X ; 0000-0002-2319-152X ; 0000-0003-2176-2509 ; 0000-0002-0199-2377</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9529190$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>230,314,776,780,881,4010,27610,27900,27901,27902,31246,54908</link.rule.ids><backlink>$$Uhttps://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-106986$$DView record from Swedish Publication Index$$Hfree_for_read</backlink></links><search><creatorcontrib>Ghafoor, Abdul</creatorcontrib><creatorcontrib>Imran, Ali Shariq</creatorcontrib><creatorcontrib>Daudpota, Sher Muhammad</creatorcontrib><creatorcontrib>Kastrati, Zenun</creatorcontrib><creatorcontrib>Abdullah</creatorcontrib><creatorcontrib>Batra, Rakhi</creatorcontrib><creatorcontrib>Wani, Mudasir Ahmad</creatorcontrib><title>The Impact of Translating Resource-Rich Datasets to Low-Resource Languages Through Multi-Lingual Text Processing</title><title>IEEE access</title><addtitle>Access</addtitle><description>Urdu is still considered a low-resource language despite being ranked as world's &lt;inline-formula&gt; &lt;tex-math notation="LaTeX"&gt;10^{th} &lt;/tex-math&gt;&lt;/inline-formula&gt; most spoken language with nearly 230 million speakers. The scarcity of benchmark datasets in low-resource languages has led researchers to utilize more ingenious techniques to curb the issue. One such option widely adopted is to use language translation services to replicate existing datasets from resource-rich languages such as English to low-resource languages, such as Urdu. For most natural language processing tasks, including polarity assessment, words translated via Google translator from one language to another often change the meaning. It results in a polarity shift causing the system's performance degradation, particularly for sentiment classification and emotion detection tasks. This study evaluates the effect of translation on the sentiment classification task from a resource-rich language to a low-resource language. It identifies and enlists words causing polarity shift into five distinct categories. It further finds the correlation between the language with similar roots. Our study shows 2-3 percentage points performance degradation in sentiment classification due to polarity shift as a result of translation from resource-rich languages to low-resource languages.</description><subject>Analytical models</subject><subject>BiLSTM</subject><subject>Classification</subject><subject>Conv1D</subject><subject>Data models</subject><subject>Datasets</subject><subject>English</subject><subject>German</subject><subject>Hindi</subject><subject>Informatik</subject><subject>Information Systems</subject><subject>Internet</subject><subject>Language</subject><subject>Language translation</subject><subject>low resource language</subject><subject>Machine translation</subject><subject>Multilingual text processing</subject><subject>Natural language processing</subject><subject>Performance degradation</subject><subject>Polarity</subject><subject>polarity assessment</subject><subject>Sentiment analysis</subject><subject>sentiment classification</subject><subject>Task analysis</subject><subject>Text processing</subject><subject>Translating</subject><subject>Translation</subject><subject>Translators</subject><subject>Urdu</subject><subject>Urdu language</subject><subject>Word processing</subject><subject>Words (language)</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>7T9</sourceid><sourceid>DOA</sourceid><recordid>eNpVkU1v3CAYhK2qlRql-QW5IOXsrQGDzXG1SduVXLVK3F4Rn15WjnEAK82_DxunUcsFGGYe6WWK4hJWGwgr9nm7293c3W1QheAGZwW15F1xhiBlJSaYvv_n_LG4iPFY5dVmiTRnxdwfDNjfz0Il4C3og5jiKJKbBnBrol-CMuWtUwdwLZKIJkWQPOj8Y_n3FXRiGhYxmAj6Q_DLcADflzG5snMnfQS9-ZPAz-CViTFLn4oPVozRXLzu58WvLzf97lvZ_fi63227UtU1TSVVQqjaUi20lqRVGGELW8pkrZFQylJpdR5Yap1vjGHcWk0YZlJqoqxt8XmxX7naiyOfg7sX4Yl74fiL4MPARUhOjYYrSRpha1MjZGtMiBTUtq3BDZYNJoZlVrmy4qOZF_kf7dr93r7QxmnhsKKspdl_tfrn4B8WExM_5r-a8rgckQZR2jQQZxdeXSr4GIOxb1xY8VOzfG2Wn5rlr83m1OWacsaYtwQjiEFW4We0xaGv</recordid><startdate>2021</startdate><enddate>2021</enddate><creator>Ghafoor, Abdul</creator><creator>Imran, Ali Shariq</creator><creator>Daudpota, Sher Muhammad</creator><creator>Kastrati, Zenun</creator><creator>Abdullah</creator><creator>Batra, Rakhi</creator><creator>Wani, Mudasir Ahmad</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>7T9</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>ADTPV</scope><scope>AGRUY</scope><scope>AOWAS</scope><scope>D8T</scope><scope>D92</scope><scope>ZZAVC</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-2416-2878</orcidid><orcidid>https://orcid.org/0000-0001-6684-751X</orcidid><orcidid>https://orcid.org/0000-0002-2319-152X</orcidid><orcidid>https://orcid.org/0000-0003-2176-2509</orcidid><orcidid>https://orcid.org/0000-0002-0199-2377</orcidid></search><sort><creationdate>2021</creationdate><title>The Impact of Translating Resource-Rich Datasets to Low-Resource Languages Through Multi-Lingual Text Processing</title><author>Ghafoor, Abdul ; Imran, Ali Shariq ; Daudpota, Sher Muhammad ; Kastrati, Zenun ; Abdullah ; Batra, Rakhi ; Wani, Mudasir Ahmad</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c446t-6caac4f6daddb58c323f1869b4d2accf6bfd021bddacc99338fd5939bbd5cff83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Analytical models</topic><topic>BiLSTM</topic><topic>Classification</topic><topic>Conv1D</topic><topic>Data models</topic><topic>Datasets</topic><topic>English</topic><topic>German</topic><topic>Hindi</topic><topic>Informatik</topic><topic>Information Systems</topic><topic>Internet</topic><topic>Language</topic><topic>Language translation</topic><topic>low resource language</topic><topic>Machine translation</topic><topic>Multilingual text processing</topic><topic>Natural language processing</topic><topic>Performance degradation</topic><topic>Polarity</topic><topic>polarity assessment</topic><topic>Sentiment analysis</topic><topic>sentiment classification</topic><topic>Task analysis</topic><topic>Text processing</topic><topic>Translating</topic><topic>Translation</topic><topic>Translators</topic><topic>Urdu</topic><topic>Urdu language</topic><topic>Word processing</topic><topic>Words (language)</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ghafoor, Abdul</creatorcontrib><creatorcontrib>Imran, Ali Shariq</creatorcontrib><creatorcontrib>Daudpota, Sher Muhammad</creatorcontrib><creatorcontrib>Kastrati, Zenun</creatorcontrib><creatorcontrib>Abdullah</creatorcontrib><creatorcontrib>Batra, Rakhi</creatorcontrib><creatorcontrib>Wani, Mudasir Ahmad</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Xplore Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library Online</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>SwePub</collection><collection>SWEPUB Linnéuniversitetet full text</collection><collection>SwePub Articles</collection><collection>SWEPUB Freely available online</collection><collection>SWEPUB Linnéuniversitetet</collection><collection>SwePub Articles full text</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ghafoor, Abdul</au><au>Imran, Ali Shariq</au><au>Daudpota, Sher Muhammad</au><au>Kastrati, Zenun</au><au>Abdullah</au><au>Batra, Rakhi</au><au>Wani, Mudasir Ahmad</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>The Impact of Translating Resource-Rich Datasets to Low-Resource Languages Through Multi-Lingual Text Processing</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2021</date><risdate>2021</risdate><volume>9</volume><spage>124478</spage><epage>124490</epage><pages>124478-124490</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>Urdu is still considered a low-resource language despite being ranked as world's &lt;inline-formula&gt; &lt;tex-math notation="LaTeX"&gt;10^{th} &lt;/tex-math&gt;&lt;/inline-formula&gt; most spoken language with nearly 230 million speakers. The scarcity of benchmark datasets in low-resource languages has led researchers to utilize more ingenious techniques to curb the issue. One such option widely adopted is to use language translation services to replicate existing datasets from resource-rich languages such as English to low-resource languages, such as Urdu. For most natural language processing tasks, including polarity assessment, words translated via Google translator from one language to another often change the meaning. It results in a polarity shift causing the system's performance degradation, particularly for sentiment classification and emotion detection tasks. This study evaluates the effect of translation on the sentiment classification task from a resource-rich language to a low-resource language. It identifies and enlists words causing polarity shift into five distinct categories. It further finds the correlation between the language with similar roots. Our study shows 2-3 percentage points performance degradation in sentiment classification due to polarity shift as a result of translation from resource-rich languages to low-resource languages.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2021.3110285</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0002-2416-2878</orcidid><orcidid>https://orcid.org/0000-0001-6684-751X</orcidid><orcidid>https://orcid.org/0000-0002-2319-152X</orcidid><orcidid>https://orcid.org/0000-0003-2176-2509</orcidid><orcidid>https://orcid.org/0000-0002-0199-2377</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2169-3536
ispartof IEEE access, 2021, Vol.9, p.124478-124490
issn 2169-3536
2169-3536
language eng
recordid cdi_ieee_primary_9529190
source Linguistics and Language Behavior Abstracts (LLBA); IEEE Xplore Open Access Journals
subjects Analytical models
BiLSTM
Classification
Conv1D
Data models
Datasets
English
German
Hindi
Informatik
Information Systems
Internet
Language
Language translation
low resource language
Machine translation
Multilingual text processing
Natural language processing
Performance degradation
Polarity
polarity assessment
Sentiment analysis
sentiment classification
Task analysis
Text processing
Translating
Translation
Translators
Urdu
Urdu language
Word processing
Words (language)
title The Impact of Translating Resource-Rich Datasets to Low-Resource Languages Through Multi-Lingual Text Processing
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-30T18%3A41%3A43IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=The%20Impact%20of%20Translating%20Resource-Rich%20Datasets%20to%20Low-Resource%20Languages%20Through%20Multi-Lingual%20Text%20Processing&rft.jtitle=IEEE%20access&rft.au=Ghafoor,%20Abdul&rft.date=2021&rft.volume=9&rft.spage=124478&rft.epage=124490&rft.pages=124478-124490&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2021.3110285&rft_dat=%3Cproquest_ieee_%3E2572667713%3C/proquest_ieee_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c446t-6caac4f6daddb58c323f1869b4d2accf6bfd021bddacc99338fd5939bbd5cff83%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2572667713&rft_id=info:pmid/&rft_ieee_id=9529190&rfr_iscdi=true