Loading…
The Impact of Translating Resource-Rich Datasets to Low-Resource Languages Through Multi-Lingual Text Processing
Urdu is still considered a low-resource language despite being ranked as world's 10^{th} most spoken language with nearly 230 million speakers. The scarcity of benchmark datasets in low-resource languages has led researchers to utilize more ingenious techniques to curb the issue. One such opt...
Saved in:
Published in: | IEEE access 2021, Vol.9, p.124478-124490 |
---|---|
Main Authors: | , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c446t-6caac4f6daddb58c323f1869b4d2accf6bfd021bddacc99338fd5939bbd5cff83 |
---|---|
cites | cdi_FETCH-LOGICAL-c446t-6caac4f6daddb58c323f1869b4d2accf6bfd021bddacc99338fd5939bbd5cff83 |
container_end_page | 124490 |
container_issue | |
container_start_page | 124478 |
container_title | IEEE access |
container_volume | 9 |
creator | Ghafoor, Abdul Imran, Ali Shariq Daudpota, Sher Muhammad Kastrati, Zenun Abdullah Batra, Rakhi Wani, Mudasir Ahmad |
description | Urdu is still considered a low-resource language despite being ranked as world's 10^{th} most spoken language with nearly 230 million speakers. The scarcity of benchmark datasets in low-resource languages has led researchers to utilize more ingenious techniques to curb the issue. One such option widely adopted is to use language translation services to replicate existing datasets from resource-rich languages such as English to low-resource languages, such as Urdu. For most natural language processing tasks, including polarity assessment, words translated via Google translator from one language to another often change the meaning. It results in a polarity shift causing the system's performance degradation, particularly for sentiment classification and emotion detection tasks. This study evaluates the effect of translation on the sentiment classification task from a resource-rich language to a low-resource language. It identifies and enlists words causing polarity shift into five distinct categories. It further finds the correlation between the language with similar roots. Our study shows 2-3 percentage points performance degradation in sentiment classification due to polarity shift as a result of translation from resource-rich languages to low-resource languages. |
doi_str_mv | 10.1109/ACCESS.2021.3110285 |
format | article |
fullrecord | <record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_ieee_primary_9529190</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9529190</ieee_id><doaj_id>oai_doaj_org_article_cb57af4e422f4355ba6f88e373b735e9</doaj_id><sourcerecordid>2572667713</sourcerecordid><originalsourceid>FETCH-LOGICAL-c446t-6caac4f6daddb58c323f1869b4d2accf6bfd021bddacc99338fd5939bbd5cff83</originalsourceid><addsrcrecordid>eNpVkU1v3CAYhK2qlRql-QW5IOXsrQGDzXG1SduVXLVK3F4Rn15WjnEAK82_DxunUcsFGGYe6WWK4hJWGwgr9nm7293c3W1QheAGZwW15F1xhiBlJSaYvv_n_LG4iPFY5dVmiTRnxdwfDNjfz0Il4C3og5jiKJKbBnBrol-CMuWtUwdwLZKIJkWQPOj8Y_n3FXRiGhYxmAj6Q_DLcADflzG5snMnfQS9-ZPAz-CViTFLn4oPVozRXLzu58WvLzf97lvZ_fi63227UtU1TSVVQqjaUi20lqRVGGELW8pkrZFQylJpdR5Yap1vjGHcWk0YZlJqoqxt8XmxX7naiyOfg7sX4Yl74fiL4MPARUhOjYYrSRpha1MjZGtMiBTUtq3BDZYNJoZlVrmy4qOZF_kf7dr93r7QxmnhsKKspdl_tfrn4B8WExM_5r-a8rgckQZR2jQQZxdeXSr4GIOxb1xY8VOzfG2Wn5rlr83m1OWacsaYtwQjiEFW4We0xaGv</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2572667713</pqid></control><display><type>article</type><title>The Impact of Translating Resource-Rich Datasets to Low-Resource Languages Through Multi-Lingual Text Processing</title><source>Linguistics and Language Behavior Abstracts (LLBA)</source><source>IEEE Xplore Open Access Journals</source><creator>Ghafoor, Abdul ; Imran, Ali Shariq ; Daudpota, Sher Muhammad ; Kastrati, Zenun ; Abdullah ; Batra, Rakhi ; Wani, Mudasir Ahmad</creator><creatorcontrib>Ghafoor, Abdul ; Imran, Ali Shariq ; Daudpota, Sher Muhammad ; Kastrati, Zenun ; Abdullah ; Batra, Rakhi ; Wani, Mudasir Ahmad</creatorcontrib><description>Urdu is still considered a low-resource language despite being ranked as world's <inline-formula> <tex-math notation="LaTeX">10^{th} </tex-math></inline-formula> most spoken language with nearly 230 million speakers. The scarcity of benchmark datasets in low-resource languages has led researchers to utilize more ingenious techniques to curb the issue. One such option widely adopted is to use language translation services to replicate existing datasets from resource-rich languages such as English to low-resource languages, such as Urdu. For most natural language processing tasks, including polarity assessment, words translated via Google translator from one language to another often change the meaning. It results in a polarity shift causing the system's performance degradation, particularly for sentiment classification and emotion detection tasks. This study evaluates the effect of translation on the sentiment classification task from a resource-rich language to a low-resource language. It identifies and enlists words causing polarity shift into five distinct categories. It further finds the correlation between the language with similar roots. Our study shows 2-3 percentage points performance degradation in sentiment classification due to polarity shift as a result of translation from resource-rich languages to low-resource languages.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2021.3110285</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Analytical models ; BiLSTM ; Classification ; Conv1D ; Data models ; Datasets ; English ; German ; Hindi ; Informatik ; Information Systems ; Internet ; Language ; Language translation ; low resource language ; Machine translation ; Multilingual text processing ; Natural language processing ; Performance degradation ; Polarity ; polarity assessment ; Sentiment analysis ; sentiment classification ; Task analysis ; Text processing ; Translating ; Translation ; Translators ; Urdu ; Urdu language ; Word processing ; Words (language)</subject><ispartof>IEEE access, 2021, Vol.9, p.124478-124490</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c446t-6caac4f6daddb58c323f1869b4d2accf6bfd021bddacc99338fd5939bbd5cff83</citedby><cites>FETCH-LOGICAL-c446t-6caac4f6daddb58c323f1869b4d2accf6bfd021bddacc99338fd5939bbd5cff83</cites><orcidid>0000-0002-2416-2878 ; 0000-0001-6684-751X ; 0000-0002-2319-152X ; 0000-0003-2176-2509 ; 0000-0002-0199-2377</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9529190$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>230,314,776,780,881,4010,27610,27900,27901,27902,31246,54908</link.rule.ids><backlink>$$Uhttps://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-106986$$DView record from Swedish Publication Index$$Hfree_for_read</backlink></links><search><creatorcontrib>Ghafoor, Abdul</creatorcontrib><creatorcontrib>Imran, Ali Shariq</creatorcontrib><creatorcontrib>Daudpota, Sher Muhammad</creatorcontrib><creatorcontrib>Kastrati, Zenun</creatorcontrib><creatorcontrib>Abdullah</creatorcontrib><creatorcontrib>Batra, Rakhi</creatorcontrib><creatorcontrib>Wani, Mudasir Ahmad</creatorcontrib><title>The Impact of Translating Resource-Rich Datasets to Low-Resource Languages Through Multi-Lingual Text Processing</title><title>IEEE access</title><addtitle>Access</addtitle><description>Urdu is still considered a low-resource language despite being ranked as world's <inline-formula> <tex-math notation="LaTeX">10^{th} </tex-math></inline-formula> most spoken language with nearly 230 million speakers. The scarcity of benchmark datasets in low-resource languages has led researchers to utilize more ingenious techniques to curb the issue. One such option widely adopted is to use language translation services to replicate existing datasets from resource-rich languages such as English to low-resource languages, such as Urdu. For most natural language processing tasks, including polarity assessment, words translated via Google translator from one language to another often change the meaning. It results in a polarity shift causing the system's performance degradation, particularly for sentiment classification and emotion detection tasks. This study evaluates the effect of translation on the sentiment classification task from a resource-rich language to a low-resource language. It identifies and enlists words causing polarity shift into five distinct categories. It further finds the correlation between the language with similar roots. Our study shows 2-3 percentage points performance degradation in sentiment classification due to polarity shift as a result of translation from resource-rich languages to low-resource languages.</description><subject>Analytical models</subject><subject>BiLSTM</subject><subject>Classification</subject><subject>Conv1D</subject><subject>Data models</subject><subject>Datasets</subject><subject>English</subject><subject>German</subject><subject>Hindi</subject><subject>Informatik</subject><subject>Information Systems</subject><subject>Internet</subject><subject>Language</subject><subject>Language translation</subject><subject>low resource language</subject><subject>Machine translation</subject><subject>Multilingual text processing</subject><subject>Natural language processing</subject><subject>Performance degradation</subject><subject>Polarity</subject><subject>polarity assessment</subject><subject>Sentiment analysis</subject><subject>sentiment classification</subject><subject>Task analysis</subject><subject>Text processing</subject><subject>Translating</subject><subject>Translation</subject><subject>Translators</subject><subject>Urdu</subject><subject>Urdu language</subject><subject>Word processing</subject><subject>Words (language)</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>7T9</sourceid><sourceid>DOA</sourceid><recordid>eNpVkU1v3CAYhK2qlRql-QW5IOXsrQGDzXG1SduVXLVK3F4Rn15WjnEAK82_DxunUcsFGGYe6WWK4hJWGwgr9nm7293c3W1QheAGZwW15F1xhiBlJSaYvv_n_LG4iPFY5dVmiTRnxdwfDNjfz0Il4C3og5jiKJKbBnBrol-CMuWtUwdwLZKIJkWQPOj8Y_n3FXRiGhYxmAj6Q_DLcADflzG5snMnfQS9-ZPAz-CViTFLn4oPVozRXLzu58WvLzf97lvZ_fi63227UtU1TSVVQqjaUi20lqRVGGELW8pkrZFQylJpdR5Yap1vjGHcWk0YZlJqoqxt8XmxX7naiyOfg7sX4Yl74fiL4MPARUhOjYYrSRpha1MjZGtMiBTUtq3BDZYNJoZlVrmy4qOZF_kf7dr93r7QxmnhsKKspdl_tfrn4B8WExM_5r-a8rgckQZR2jQQZxdeXSr4GIOxb1xY8VOzfG2Wn5rlr83m1OWacsaYtwQjiEFW4We0xaGv</recordid><startdate>2021</startdate><enddate>2021</enddate><creator>Ghafoor, Abdul</creator><creator>Imran, Ali Shariq</creator><creator>Daudpota, Sher Muhammad</creator><creator>Kastrati, Zenun</creator><creator>Abdullah</creator><creator>Batra, Rakhi</creator><creator>Wani, Mudasir Ahmad</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>7T9</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>ADTPV</scope><scope>AGRUY</scope><scope>AOWAS</scope><scope>D8T</scope><scope>D92</scope><scope>ZZAVC</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-2416-2878</orcidid><orcidid>https://orcid.org/0000-0001-6684-751X</orcidid><orcidid>https://orcid.org/0000-0002-2319-152X</orcidid><orcidid>https://orcid.org/0000-0003-2176-2509</orcidid><orcidid>https://orcid.org/0000-0002-0199-2377</orcidid></search><sort><creationdate>2021</creationdate><title>The Impact of Translating Resource-Rich Datasets to Low-Resource Languages Through Multi-Lingual Text Processing</title><author>Ghafoor, Abdul ; Imran, Ali Shariq ; Daudpota, Sher Muhammad ; Kastrati, Zenun ; Abdullah ; Batra, Rakhi ; Wani, Mudasir Ahmad</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c446t-6caac4f6daddb58c323f1869b4d2accf6bfd021bddacc99338fd5939bbd5cff83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Analytical models</topic><topic>BiLSTM</topic><topic>Classification</topic><topic>Conv1D</topic><topic>Data models</topic><topic>Datasets</topic><topic>English</topic><topic>German</topic><topic>Hindi</topic><topic>Informatik</topic><topic>Information Systems</topic><topic>Internet</topic><topic>Language</topic><topic>Language translation</topic><topic>low resource language</topic><topic>Machine translation</topic><topic>Multilingual text processing</topic><topic>Natural language processing</topic><topic>Performance degradation</topic><topic>Polarity</topic><topic>polarity assessment</topic><topic>Sentiment analysis</topic><topic>sentiment classification</topic><topic>Task analysis</topic><topic>Text processing</topic><topic>Translating</topic><topic>Translation</topic><topic>Translators</topic><topic>Urdu</topic><topic>Urdu language</topic><topic>Word processing</topic><topic>Words (language)</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ghafoor, Abdul</creatorcontrib><creatorcontrib>Imran, Ali Shariq</creatorcontrib><creatorcontrib>Daudpota, Sher Muhammad</creatorcontrib><creatorcontrib>Kastrati, Zenun</creatorcontrib><creatorcontrib>Abdullah</creatorcontrib><creatorcontrib>Batra, Rakhi</creatorcontrib><creatorcontrib>Wani, Mudasir Ahmad</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Xplore Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library Online</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>SwePub</collection><collection>SWEPUB Linnéuniversitetet full text</collection><collection>SwePub Articles</collection><collection>SWEPUB Freely available online</collection><collection>SWEPUB Linnéuniversitetet</collection><collection>SwePub Articles full text</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ghafoor, Abdul</au><au>Imran, Ali Shariq</au><au>Daudpota, Sher Muhammad</au><au>Kastrati, Zenun</au><au>Abdullah</au><au>Batra, Rakhi</au><au>Wani, Mudasir Ahmad</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>The Impact of Translating Resource-Rich Datasets to Low-Resource Languages Through Multi-Lingual Text Processing</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2021</date><risdate>2021</risdate><volume>9</volume><spage>124478</spage><epage>124490</epage><pages>124478-124490</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>Urdu is still considered a low-resource language despite being ranked as world's <inline-formula> <tex-math notation="LaTeX">10^{th} </tex-math></inline-formula> most spoken language with nearly 230 million speakers. The scarcity of benchmark datasets in low-resource languages has led researchers to utilize more ingenious techniques to curb the issue. One such option widely adopted is to use language translation services to replicate existing datasets from resource-rich languages such as English to low-resource languages, such as Urdu. For most natural language processing tasks, including polarity assessment, words translated via Google translator from one language to another often change the meaning. It results in a polarity shift causing the system's performance degradation, particularly for sentiment classification and emotion detection tasks. This study evaluates the effect of translation on the sentiment classification task from a resource-rich language to a low-resource language. It identifies and enlists words causing polarity shift into five distinct categories. It further finds the correlation between the language with similar roots. Our study shows 2-3 percentage points performance degradation in sentiment classification due to polarity shift as a result of translation from resource-rich languages to low-resource languages.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2021.3110285</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0002-2416-2878</orcidid><orcidid>https://orcid.org/0000-0001-6684-751X</orcidid><orcidid>https://orcid.org/0000-0002-2319-152X</orcidid><orcidid>https://orcid.org/0000-0003-2176-2509</orcidid><orcidid>https://orcid.org/0000-0002-0199-2377</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2169-3536 |
ispartof | IEEE access, 2021, Vol.9, p.124478-124490 |
issn | 2169-3536 2169-3536 |
language | eng |
recordid | cdi_ieee_primary_9529190 |
source | Linguistics and Language Behavior Abstracts (LLBA); IEEE Xplore Open Access Journals |
subjects | Analytical models BiLSTM Classification Conv1D Data models Datasets English German Hindi Informatik Information Systems Internet Language Language translation low resource language Machine translation Multilingual text processing Natural language processing Performance degradation Polarity polarity assessment Sentiment analysis sentiment classification Task analysis Text processing Translating Translation Translators Urdu Urdu language Word processing Words (language) |
title | The Impact of Translating Resource-Rich Datasets to Low-Resource Languages Through Multi-Lingual Text Processing |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-30T18%3A41%3A43IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=The%20Impact%20of%20Translating%20Resource-Rich%20Datasets%20to%20Low-Resource%20Languages%20Through%20Multi-Lingual%20Text%20Processing&rft.jtitle=IEEE%20access&rft.au=Ghafoor,%20Abdul&rft.date=2021&rft.volume=9&rft.spage=124478&rft.epage=124490&rft.pages=124478-124490&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2021.3110285&rft_dat=%3Cproquest_ieee_%3E2572667713%3C/proquest_ieee_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c446t-6caac4f6daddb58c323f1869b4d2accf6bfd021bddacc99338fd5939bbd5cff83%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2572667713&rft_id=info:pmid/&rft_ieee_id=9529190&rfr_iscdi=true |