Loading…

Semantic Tagging for the Urdu Language: Annotated Corpus and Multi-Target Classification Methods

Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as natural language processing, corpus linguistics, information retrieval, and data science. An important aspect of such automatic information extraction...

Full description

Saved in:
Bibliographic Details
Published in:ACM transactions on Asian and low-resource language information processing 2023-06, Vol.22 (6), p.1-32, Article 175
Main Authors: Shafi, Jawad, Adeel Nawab, Rao Muhammad, Rayson, Paul
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-a239t-ac7ad09d7940b365f323572c4410b08f75d806ba5c79e7413c728551a045c6763
container_end_page 32
container_issue 6
container_start_page 1
container_title ACM transactions on Asian and low-resource language information processing
container_volume 22
creator Shafi, Jawad
Adeel Nawab, Rao Muhammad
Rayson, Paul
description Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as natural language processing, corpus linguistics, information retrieval, and data science. An important aspect of such automatic information extraction and analysis is the annotation of language data using semantic tagging tools. Different semantic tagging tools have been designed to carry out various levels of semantic analysis, for instance, named entity recognition and disambiguation, sentiment analysis, word sense disambiguation, content analysis, and semantic role labelling. Common to all of these tasks, in the supervised setting, is the requirement for a manually semantically annotated corpus, which acts as a knowledge base from which to train and test potential word and phrase-level sense annotations. Many benchmark corpora have been developed for various semantic tagging tasks, but most are for English and other European languages. There is a dearth of semantically annotated corpora for the Urdu language, which is widely spoken and used around the world. To fill this gap, this study presents a large benchmark corpus and methods for the semantic tagging task for the Urdu language. The proposed corpus contains 8,000 tokens in the following domains or genres: news, social media, Wikipedia, and historical text (each domain having 2K tokens). The corpus has been manually annotated with 21 major semantic fields and 232 sub-fields with the USAS (UCREL Semantic Analysis System) semantic taxonomy which provides a comprehensive set of semantic fields for coarse-grained annotation. Each word in our proposed corpus has been annotated with at least one and up to nine semantic field tags to provide a detailed semantic analysis of the language data, which allowed us to treat the problem of semantic tagging as a supervised multi-target classification task. To demonstrate how our proposed corpus can be used for the development and evaluation of Urdu semantic tagging methods, we extracted local, topical and semantic features from the proposed corpus and applied seven different supervised multi-target classifiers to them. Results show an accuracy of 94% on our proposed corpus which is free and publicly available to download.
doi_str_mv 10.1145/3582496
format article
fullrecord <record><control><sourceid>acm_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3582496</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3582496</sourcerecordid><originalsourceid>FETCH-LOGICAL-a239t-ac7ad09d7940b365f323572c4410b08f75d806ba5c79e7413c728551a045c6763</originalsourceid><addsrcrecordid>eNo9kDtPwzAYRS0EElWp2Jm8MQX8dsxWRbykVgy0c_hqO6lR61S2M_DvAbUw3SvdozschK4puaNUyHsuayaMOkMTxrWshCbs_K8rYy7RLOdPQggVWilCJ-jj3e8hlmDxCvo-xB53Q8Jl6_E6uREvIPYj9P4Bz2McChTvcDOkw5gxRIeX466EagWp9wU3O8g5dMFCCUPES1-2g8tX6KKDXfazU07R-ulx1bxUi7fn12a-qIBxUyqwGhwxThtBNlzJjjMuNbNCULIhdaelq4nagLTaeC0ot5rVUlIgQlqlFZ-i2-OvTUPOyXftIYU9pK-WkvbXTXty80PeHEmw-3_ob_wG7xpdRA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Semantic Tagging for the Urdu Language: Annotated Corpus and Multi-Target Classification Methods</title><source>Association for Computing Machinery:Jisc Collections:ACM OPEN Journals 2023-2025 (reading list)</source><creator>Shafi, Jawad ; Adeel Nawab, Rao Muhammad ; Rayson, Paul</creator><creatorcontrib>Shafi, Jawad ; Adeel Nawab, Rao Muhammad ; Rayson, Paul</creatorcontrib><description>Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as natural language processing, corpus linguistics, information retrieval, and data science. An important aspect of such automatic information extraction and analysis is the annotation of language data using semantic tagging tools. Different semantic tagging tools have been designed to carry out various levels of semantic analysis, for instance, named entity recognition and disambiguation, sentiment analysis, word sense disambiguation, content analysis, and semantic role labelling. Common to all of these tasks, in the supervised setting, is the requirement for a manually semantically annotated corpus, which acts as a knowledge base from which to train and test potential word and phrase-level sense annotations. Many benchmark corpora have been developed for various semantic tagging tasks, but most are for English and other European languages. There is a dearth of semantically annotated corpora for the Urdu language, which is widely spoken and used around the world. To fill this gap, this study presents a large benchmark corpus and methods for the semantic tagging task for the Urdu language. The proposed corpus contains 8,000 tokens in the following domains or genres: news, social media, Wikipedia, and historical text (each domain having 2K tokens). The corpus has been manually annotated with 21 major semantic fields and 232 sub-fields with the USAS (UCREL Semantic Analysis System) semantic taxonomy which provides a comprehensive set of semantic fields for coarse-grained annotation. Each word in our proposed corpus has been annotated with at least one and up to nine semantic field tags to provide a detailed semantic analysis of the language data, which allowed us to treat the problem of semantic tagging as a supervised multi-target classification task. To demonstrate how our proposed corpus can be used for the development and evaluation of Urdu semantic tagging methods, we extracted local, topical and semantic features from the proposed corpus and applied seven different supervised multi-target classifiers to them. Results show an accuracy of 94% on our proposed corpus which is free and publicly available to download.</description><identifier>ISSN: 2375-4699</identifier><identifier>EISSN: 2375-4702</identifier><identifier>DOI: 10.1145/3582496</identifier><language>eng</language><publisher>New York, NY: ACM</publisher><subject>Computing methodologies ; Machine learning approaches ; Natural language processing</subject><ispartof>ACM transactions on Asian and low-resource language information processing, 2023-06, Vol.22 (6), p.1-32, Article 175</ispartof><rights>Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-a239t-ac7ad09d7940b365f323572c4410b08f75d806ba5c79e7413c728551a045c6763</cites><orcidid>0000-0002-1765-8904 ; 0000-0001-6427-3823 ; 0000-0002-1257-2191</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Shafi, Jawad</creatorcontrib><creatorcontrib>Adeel Nawab, Rao Muhammad</creatorcontrib><creatorcontrib>Rayson, Paul</creatorcontrib><title>Semantic Tagging for the Urdu Language: Annotated Corpus and Multi-Target Classification Methods</title><title>ACM transactions on Asian and low-resource language information processing</title><addtitle>ACM TALLIP</addtitle><description>Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as natural language processing, corpus linguistics, information retrieval, and data science. An important aspect of such automatic information extraction and analysis is the annotation of language data using semantic tagging tools. Different semantic tagging tools have been designed to carry out various levels of semantic analysis, for instance, named entity recognition and disambiguation, sentiment analysis, word sense disambiguation, content analysis, and semantic role labelling. Common to all of these tasks, in the supervised setting, is the requirement for a manually semantically annotated corpus, which acts as a knowledge base from which to train and test potential word and phrase-level sense annotations. Many benchmark corpora have been developed for various semantic tagging tasks, but most are for English and other European languages. There is a dearth of semantically annotated corpora for the Urdu language, which is widely spoken and used around the world. To fill this gap, this study presents a large benchmark corpus and methods for the semantic tagging task for the Urdu language. The proposed corpus contains 8,000 tokens in the following domains or genres: news, social media, Wikipedia, and historical text (each domain having 2K tokens). The corpus has been manually annotated with 21 major semantic fields and 232 sub-fields with the USAS (UCREL Semantic Analysis System) semantic taxonomy which provides a comprehensive set of semantic fields for coarse-grained annotation. Each word in our proposed corpus has been annotated with at least one and up to nine semantic field tags to provide a detailed semantic analysis of the language data, which allowed us to treat the problem of semantic tagging as a supervised multi-target classification task. To demonstrate how our proposed corpus can be used for the development and evaluation of Urdu semantic tagging methods, we extracted local, topical and semantic features from the proposed corpus and applied seven different supervised multi-target classifiers to them. Results show an accuracy of 94% on our proposed corpus which is free and publicly available to download.</description><subject>Computing methodologies</subject><subject>Machine learning approaches</subject><subject>Natural language processing</subject><issn>2375-4699</issn><issn>2375-4702</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNo9kDtPwzAYRS0EElWp2Jm8MQX8dsxWRbykVgy0c_hqO6lR61S2M_DvAbUw3SvdozschK4puaNUyHsuayaMOkMTxrWshCbs_K8rYy7RLOdPQggVWilCJ-jj3e8hlmDxCvo-xB53Q8Jl6_E6uREvIPYj9P4Bz2McChTvcDOkw5gxRIeX466EagWp9wU3O8g5dMFCCUPES1-2g8tX6KKDXfazU07R-ulx1bxUi7fn12a-qIBxUyqwGhwxThtBNlzJjjMuNbNCULIhdaelq4nagLTaeC0ot5rVUlIgQlqlFZ-i2-OvTUPOyXftIYU9pK-WkvbXTXty80PeHEmw-3_ob_wG7xpdRA</recordid><startdate>20230617</startdate><enddate>20230617</enddate><creator>Shafi, Jawad</creator><creator>Adeel Nawab, Rao Muhammad</creator><creator>Rayson, Paul</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-1765-8904</orcidid><orcidid>https://orcid.org/0000-0001-6427-3823</orcidid><orcidid>https://orcid.org/0000-0002-1257-2191</orcidid></search><sort><creationdate>20230617</creationdate><title>Semantic Tagging for the Urdu Language: Annotated Corpus and Multi-Target Classification Methods</title><author>Shafi, Jawad ; Adeel Nawab, Rao Muhammad ; Rayson, Paul</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a239t-ac7ad09d7940b365f323572c4410b08f75d806ba5c79e7413c728551a045c6763</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computing methodologies</topic><topic>Machine learning approaches</topic><topic>Natural language processing</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Shafi, Jawad</creatorcontrib><creatorcontrib>Adeel Nawab, Rao Muhammad</creatorcontrib><creatorcontrib>Rayson, Paul</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on Asian and low-resource language information processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Shafi, Jawad</au><au>Adeel Nawab, Rao Muhammad</au><au>Rayson, Paul</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Semantic Tagging for the Urdu Language: Annotated Corpus and Multi-Target Classification Methods</atitle><jtitle>ACM transactions on Asian and low-resource language information processing</jtitle><stitle>ACM TALLIP</stitle><date>2023-06-17</date><risdate>2023</risdate><volume>22</volume><issue>6</issue><spage>1</spage><epage>32</epage><pages>1-32</pages><artnum>175</artnum><issn>2375-4699</issn><eissn>2375-4702</eissn><abstract>Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as natural language processing, corpus linguistics, information retrieval, and data science. An important aspect of such automatic information extraction and analysis is the annotation of language data using semantic tagging tools. Different semantic tagging tools have been designed to carry out various levels of semantic analysis, for instance, named entity recognition and disambiguation, sentiment analysis, word sense disambiguation, content analysis, and semantic role labelling. Common to all of these tasks, in the supervised setting, is the requirement for a manually semantically annotated corpus, which acts as a knowledge base from which to train and test potential word and phrase-level sense annotations. Many benchmark corpora have been developed for various semantic tagging tasks, but most are for English and other European languages. There is a dearth of semantically annotated corpora for the Urdu language, which is widely spoken and used around the world. To fill this gap, this study presents a large benchmark corpus and methods for the semantic tagging task for the Urdu language. The proposed corpus contains 8,000 tokens in the following domains or genres: news, social media, Wikipedia, and historical text (each domain having 2K tokens). The corpus has been manually annotated with 21 major semantic fields and 232 sub-fields with the USAS (UCREL Semantic Analysis System) semantic taxonomy which provides a comprehensive set of semantic fields for coarse-grained annotation. Each word in our proposed corpus has been annotated with at least one and up to nine semantic field tags to provide a detailed semantic analysis of the language data, which allowed us to treat the problem of semantic tagging as a supervised multi-target classification task. To demonstrate how our proposed corpus can be used for the development and evaluation of Urdu semantic tagging methods, we extracted local, topical and semantic features from the proposed corpus and applied seven different supervised multi-target classifiers to them. Results show an accuracy of 94% on our proposed corpus which is free and publicly available to download.</abstract><cop>New York, NY</cop><pub>ACM</pub><doi>10.1145/3582496</doi><tpages>32</tpages><orcidid>https://orcid.org/0000-0002-1765-8904</orcidid><orcidid>https://orcid.org/0000-0001-6427-3823</orcidid><orcidid>https://orcid.org/0000-0002-1257-2191</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2375-4699
ispartof ACM transactions on Asian and low-resource language information processing, 2023-06, Vol.22 (6), p.1-32, Article 175
issn 2375-4699
2375-4702
language eng
recordid cdi_crossref_primary_10_1145_3582496
source Association for Computing Machinery:Jisc Collections:ACM OPEN Journals 2023-2025 (reading list)
subjects Computing methodologies
Machine learning approaches
Natural language processing
title Semantic Tagging for the Urdu Language: Annotated Corpus and Multi-Target Classification Methods
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-02T21%3A55%3A17IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Semantic%20Tagging%20for%20the%20Urdu%20Language:%20Annotated%20Corpus%20and%20Multi-Target%20Classification%20Methods&rft.jtitle=ACM%20transactions%20on%20Asian%20and%20low-resource%20language%20information%20processing&rft.au=Shafi,%20Jawad&rft.date=2023-06-17&rft.volume=22&rft.issue=6&rft.spage=1&rft.epage=32&rft.pages=1-32&rft.artnum=175&rft.issn=2375-4699&rft.eissn=2375-4702&rft_id=info:doi/10.1145/3582496&rft_dat=%3Cacm_cross%3E3582496%3C/acm_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-a239t-ac7ad09d7940b365f323572c4410b08f75d806ba5c79e7413c728551a045c6763%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true