Loading…

CALAM: model-based compilation and linguistic statistical analysis of Urdu corpus

In this paper, we introduce an efficient framework for the compilation of an Urdu corpus along with ground truth and transcription in Unicode format. A novel scheme of the annotation based on four-level XML has been incorporated for the corpus CALAM. In addition to compilation and benchmarking test,...

Full description

Saved in:
Bibliographic Details
Published in:Sadhana (Bangalore) 2020-12, Vol.45 (1), Article 20
Main Authors: Choudhary, Prakash, Nain, Neeta
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c268t-5e542bb87cf7b0233fb28df6c03643cba58050fafebfaba22f183a26ca9ac3f33
container_end_page
container_issue 1
container_start_page
container_title Sadhana (Bangalore)
container_volume 45
creator Choudhary, Prakash
Nain, Neeta
description In this paper, we introduce an efficient framework for the compilation of an Urdu corpus along with ground truth and transcription in Unicode format. A novel scheme of the annotation based on four-level XML has been incorporated for the corpus CALAM. In addition to compilation and benchmarking test, the framework generates the word frequency distribution according to category sapient useful for linguistic evaluation. This paper presents the statistical analysis with corpus data based on transcript text and frequency of occurrences. The observation of statistical analysis is conducted using vital statistics like rank of words, the frequency of words, ligatures length (number of ligatures with combination of two to seven characters), entropy and perplexity of the corpus. Besides rudimental statistics coverage, some additional statistical features are also evaluated like Zipf’s linguistic rule and measurement of dispersion in corpus information. The experimental results obtained from statistical observation are presented for asserting viability and usability of the corpus data as a standard platform for linguistic research on the Urdu language.
doi_str_mv 10.1007/s12046-019-1237-3
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2343581756</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2343581756</sourcerecordid><originalsourceid>FETCH-LOGICAL-c268t-5e542bb87cf7b0233fb28df6c03643cba58050fafebfaba22f183a26ca9ac3f33</originalsourceid><addsrcrecordid>eNp1kE1LxDAQhoMouK7-AG8Fz9F8NEnrbVn8ghUR3HNI0mTp0i8z7WH_vSkVPHmaYeZ93xkehG4puaeEqAegjOQSE1piyrjC_AytSKk4VlKp89QzITHLy_ISXQEcCWGKFHyFPreb3eb9MWv7yjfYGvBV5vp2qBsz1n2Xma7Kmro7TDWMtctgTOO5M01ameYENWR9yPaxmpIvDhNco4tgGvA3v3WN9s9PX9tXvPt4eUvXsGOyGLHwImfWFsoFZQnjPFhWVEE6wmXOnTWiIIIEE7wNxhrGAi24YdKZ0jgeOF-juyV3iP335GHUx36K6SfQjOdcFFQJmVR0UbnYA0Qf9BDr1sSTpkTP5PRCTidyeian52S2eCBpu4OPf8n_m34AwkhxhQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2343581756</pqid></control><display><type>article</type><title>CALAM: model-based compilation and linguistic statistical analysis of Urdu corpus</title><source>Springer Nature</source><creator>Choudhary, Prakash ; Nain, Neeta</creator><creatorcontrib>Choudhary, Prakash ; Nain, Neeta</creatorcontrib><description>In this paper, we introduce an efficient framework for the compilation of an Urdu corpus along with ground truth and transcription in Unicode format. A novel scheme of the annotation based on four-level XML has been incorporated for the corpus CALAM. In addition to compilation and benchmarking test, the framework generates the word frequency distribution according to category sapient useful for linguistic evaluation. This paper presents the statistical analysis with corpus data based on transcript text and frequency of occurrences. The observation of statistical analysis is conducted using vital statistics like rank of words, the frequency of words, ligatures length (number of ligatures with combination of two to seven characters), entropy and perplexity of the corpus. Besides rudimental statistics coverage, some additional statistical features are also evaluated like Zipf’s linguistic rule and measurement of dispersion in corpus information. The experimental results obtained from statistical observation are presented for asserting viability and usability of the corpus data as a standard platform for linguistic research on the Urdu language.</description><identifier>ISSN: 0256-2499</identifier><identifier>EISSN: 0973-7677</identifier><identifier>DOI: 10.1007/s12046-019-1237-3</identifier><language>eng</language><publisher>New Delhi: Springer India</publisher><subject>Annotations ; Engineering ; Evaluation ; Frequency distribution ; Ground truth ; Linguistics ; Quantitative analysis ; Statistical analysis ; Viability ; Words (language)</subject><ispartof>Sadhana (Bangalore), 2020-12, Vol.45 (1), Article 20</ispartof><rights>Indian Academy of Sciences 2020</rights><rights>Indian Academy of Sciences 2020.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c268t-5e542bb87cf7b0233fb28df6c03643cba58050fafebfaba22f183a26ca9ac3f33</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Choudhary, Prakash</creatorcontrib><creatorcontrib>Nain, Neeta</creatorcontrib><title>CALAM: model-based compilation and linguistic statistical analysis of Urdu corpus</title><title>Sadhana (Bangalore)</title><addtitle>Sādhanā</addtitle><description>In this paper, we introduce an efficient framework for the compilation of an Urdu corpus along with ground truth and transcription in Unicode format. A novel scheme of the annotation based on four-level XML has been incorporated for the corpus CALAM. In addition to compilation and benchmarking test, the framework generates the word frequency distribution according to category sapient useful for linguistic evaluation. This paper presents the statistical analysis with corpus data based on transcript text and frequency of occurrences. The observation of statistical analysis is conducted using vital statistics like rank of words, the frequency of words, ligatures length (number of ligatures with combination of two to seven characters), entropy and perplexity of the corpus. Besides rudimental statistics coverage, some additional statistical features are also evaluated like Zipf’s linguistic rule and measurement of dispersion in corpus information. The experimental results obtained from statistical observation are presented for asserting viability and usability of the corpus data as a standard platform for linguistic research on the Urdu language.</description><subject>Annotations</subject><subject>Engineering</subject><subject>Evaluation</subject><subject>Frequency distribution</subject><subject>Ground truth</subject><subject>Linguistics</subject><subject>Quantitative analysis</subject><subject>Statistical analysis</subject><subject>Viability</subject><subject>Words (language)</subject><issn>0256-2499</issn><issn>0973-7677</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><recordid>eNp1kE1LxDAQhoMouK7-AG8Fz9F8NEnrbVn8ghUR3HNI0mTp0i8z7WH_vSkVPHmaYeZ93xkehG4puaeEqAegjOQSE1piyrjC_AytSKk4VlKp89QzITHLy_ISXQEcCWGKFHyFPreb3eb9MWv7yjfYGvBV5vp2qBsz1n2Xma7Kmro7TDWMtctgTOO5M01ameYENWR9yPaxmpIvDhNco4tgGvA3v3WN9s9PX9tXvPt4eUvXsGOyGLHwImfWFsoFZQnjPFhWVEE6wmXOnTWiIIIEE7wNxhrGAi24YdKZ0jgeOF-juyV3iP335GHUx36K6SfQjOdcFFQJmVR0UbnYA0Qf9BDr1sSTpkTP5PRCTidyeian52S2eCBpu4OPf8n_m34AwkhxhQ</recordid><startdate>20201201</startdate><enddate>20201201</enddate><creator>Choudhary, Prakash</creator><creator>Nain, Neeta</creator><general>Springer India</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20201201</creationdate><title>CALAM: model-based compilation and linguistic statistical analysis of Urdu corpus</title><author>Choudhary, Prakash ; Nain, Neeta</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c268t-5e542bb87cf7b0233fb28df6c03643cba58050fafebfaba22f183a26ca9ac3f33</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Annotations</topic><topic>Engineering</topic><topic>Evaluation</topic><topic>Frequency distribution</topic><topic>Ground truth</topic><topic>Linguistics</topic><topic>Quantitative analysis</topic><topic>Statistical analysis</topic><topic>Viability</topic><topic>Words (language)</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Choudhary, Prakash</creatorcontrib><creatorcontrib>Nain, Neeta</creatorcontrib><collection>CrossRef</collection><jtitle>Sadhana (Bangalore)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Choudhary, Prakash</au><au>Nain, Neeta</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>CALAM: model-based compilation and linguistic statistical analysis of Urdu corpus</atitle><jtitle>Sadhana (Bangalore)</jtitle><stitle>Sādhanā</stitle><date>2020-12-01</date><risdate>2020</risdate><volume>45</volume><issue>1</issue><artnum>20</artnum><issn>0256-2499</issn><eissn>0973-7677</eissn><abstract>In this paper, we introduce an efficient framework for the compilation of an Urdu corpus along with ground truth and transcription in Unicode format. A novel scheme of the annotation based on four-level XML has been incorporated for the corpus CALAM. In addition to compilation and benchmarking test, the framework generates the word frequency distribution according to category sapient useful for linguistic evaluation. This paper presents the statistical analysis with corpus data based on transcript text and frequency of occurrences. The observation of statistical analysis is conducted using vital statistics like rank of words, the frequency of words, ligatures length (number of ligatures with combination of two to seven characters), entropy and perplexity of the corpus. Besides rudimental statistics coverage, some additional statistical features are also evaluated like Zipf’s linguistic rule and measurement of dispersion in corpus information. The experimental results obtained from statistical observation are presented for asserting viability and usability of the corpus data as a standard platform for linguistic research on the Urdu language.</abstract><cop>New Delhi</cop><pub>Springer India</pub><doi>10.1007/s12046-019-1237-3</doi></addata></record>
fulltext fulltext
identifier ISSN: 0256-2499
ispartof Sadhana (Bangalore), 2020-12, Vol.45 (1), Article 20
issn 0256-2499
0973-7677
language eng
recordid cdi_proquest_journals_2343581756
source Springer Nature
subjects Annotations
Engineering
Evaluation
Frequency distribution
Ground truth
Linguistics
Quantitative analysis
Statistical analysis
Viability
Words (language)
title CALAM: model-based compilation and linguistic statistical analysis of Urdu corpus
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T20%3A12%3A53IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=CALAM:%20model-based%20compilation%20and%20linguistic%20statistical%20analysis%20of%20Urdu%20corpus&rft.jtitle=Sadhana%20(Bangalore)&rft.au=Choudhary,%20Prakash&rft.date=2020-12-01&rft.volume=45&rft.issue=1&rft.artnum=20&rft.issn=0256-2499&rft.eissn=0973-7677&rft_id=info:doi/10.1007/s12046-019-1237-3&rft_dat=%3Cproquest_cross%3E2343581756%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c268t-5e542bb87cf7b0233fb28df6c03643cba58050fafebfaba22f183a26ca9ac3f33%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2343581756&rft_id=info:pmid/&rfr_iscdi=true