Loading…

PRo-Pat: Probabilistic Root–Pattern Bi-gram data language model for Arabic based morphological analysis and distribution

Based on 29,192,662 html files obtained from the ClueWeb a bi-gram data language model for Arabic is constructed. The created dataset is considering standard types of bi-gram analysis, however with focus on the root11An Arabic root depict the basic morpheme of an Arabic word at a higher level of abs...

Full description

Saved in:
Bibliographic Details
Published in:Data in brief 2023-02, Vol.46, p.108875-108875, Article 108875
Main Authors: Haddad, Bassam, Awwad, Ahmad, Hattab, Mamoun, Hattab, Ammar
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c517t-f0ed8d58400f43dfb3e1e3187d685fe489f682490d3a38063cb7d76d66a7601c3
cites cdi_FETCH-LOGICAL-c517t-f0ed8d58400f43dfb3e1e3187d685fe489f682490d3a38063cb7d76d66a7601c3
container_end_page 108875
container_issue
container_start_page 108875
container_title Data in brief
container_volume 46
creator Haddad, Bassam
Awwad, Ahmad
Hattab, Mamoun
Hattab, Ammar
description Based on 29,192,662 html files obtained from the ClueWeb a bi-gram data language model for Arabic is constructed. The created dataset is considering standard types of bi-gram analysis, however with focus on the root11An Arabic root depict the basic morpheme of an Arabic word at a higher level of abstraction, representing the basic word meaning. A root morpheme consists predominantly of three consonants (radicals) identifying the highest semantic abstraction and is unchangeable. We use the variables C1, C2 and C3 to represents these radicals.-pattern22An Arabic pattern is templatic shape of consonants (root radicals) and vowels order depicting the morpho-phonetic form of a word fulfilling potential phonetic, syntactic, and semantic data. For example: the pattern maC1C2uC3[3]. paradigm in Arabic. Root-Pattern distributions in form of P(root|pattern), P(pattern|root) and P(pattern|pattern) are additionally estimated. The aspect of considering the Maximum Likelihood Estimation (MLE) on the root-pattern level as a higher-level of abstraction, has been widely neglected in Arabic research community despite its advantage in reducing ambiguities within Arabic morphological analysis and its impact on cognitive aspect on Arabic word perception [1]. In the preprocessing phase, the html files were converted to 974 unfiltered raw text files with the size of about 180 GB. These files were morphologically analyzed towards extracting and counting frequencies of patterns, roots, particle, and stems and particularly root-pattern occurrences. Based on a resulting corpus containing around 18,482,719 raw words, a language data model is constructed containing 9,311,246 bi-grams of morphologically analyzed wordform, including around 3.49 million bi-directional P(root|pattern) and around 1.153 million P(patttern|pattern) bi-grams in form of conditional probabilities covering a subset of around 8086 roots with 20413 possible pattern-forms. As this data model is considering the root–pattern phenomenon in Arabic, the created data are useful for researchers working on cognitive aspect of Arabic such as visual word cognition, morpho-phonetic perception, morphological analysis, spell-checking, and resolving ambiguities in morphological parsing.
doi_str_mv 10.1016/j.dib.2022.108875
format article
fullrecord <record><control><sourceid>proquest_doaj_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_64fe0b4dd32d4f4db53a12d5365d33d2</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S2352340922010782</els_id><doaj_id>oai_doaj_org_article_64fe0b4dd32d4f4db53a12d5365d33d2</doaj_id><sourcerecordid>2768810271</sourcerecordid><originalsourceid>FETCH-LOGICAL-c517t-f0ed8d58400f43dfb3e1e3187d685fe489f682490d3a38063cb7d76d66a7601c3</originalsourceid><addsrcrecordid>eNp9ks1u1DAQxyMEolXpA3BBPnLJ4u84ICG1FR-VKrGq4Gw5nknqVRIvdrZSOfEOvCFPgsuWqr1w8njmPz_b439VvWR0xSjTbzYrCN2KU87L3phGPakOuVC8FpK2Tx_EB9VxzhtKKVOyJNXz6kBobRqmzGH1Y30Z67Vb3pJ1ip3rwhjyEjy5jHH5_fNXqSyYZnIa6iG5iYBbHBndPOzcgGSKgCPpYyInqbR60rmMUNJpexXHOATvRuJmN97kkEsABAo9hW63hDi_qJ71bsx4fLceVd8-fvh69rm--PLp_OzkovaKNUvdUwQDykhKeymg7wQyFMw0oI3qUZq214bLloJwwlAtfNdAo0Fr12jKvDiqzvdciG5jtylMLt3Y6IL9m4hpsC6VN49oteyRdhJAcJC9hE4JxzgooRUIAbyw3u9Z2103IXicl-TGR9DHlTlc2SFe29Yo3nJZAK_vACl-32Fe7BSyx7HMFOMuW95oYxjlDStStpf6FHNO2N8fw6i9tYDd2GIBe2sBu7dA6Xn18H73Hf8-vAje7QVYJn4dMNnsA84eIST0SxlJ-A_-D7k-w-k</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2768810271</pqid></control><display><type>article</type><title>PRo-Pat: Probabilistic Root–Pattern Bi-gram data language model for Arabic based morphological analysis and distribution</title><source>PubMed (Medline)</source><source>ScienceDirect®</source><creator>Haddad, Bassam ; Awwad, Ahmad ; Hattab, Mamoun ; Hattab, Ammar</creator><creatorcontrib>Haddad, Bassam ; Awwad, Ahmad ; Hattab, Mamoun ; Hattab, Ammar</creatorcontrib><description>Based on 29,192,662 html files obtained from the ClueWeb a bi-gram data language model for Arabic is constructed. The created dataset is considering standard types of bi-gram analysis, however with focus on the root11An Arabic root depict the basic morpheme of an Arabic word at a higher level of abstraction, representing the basic word meaning. A root morpheme consists predominantly of three consonants (radicals) identifying the highest semantic abstraction and is unchangeable. We use the variables C1, C2 and C3 to represents these radicals.-pattern22An Arabic pattern is templatic shape of consonants (root radicals) and vowels order depicting the morpho-phonetic form of a word fulfilling potential phonetic, syntactic, and semantic data. For example: the pattern maC1C2uC3[3]. paradigm in Arabic. Root-Pattern distributions in form of P(root|pattern), P(pattern|root) and P(pattern|pattern) are additionally estimated. The aspect of considering the Maximum Likelihood Estimation (MLE) on the root-pattern level as a higher-level of abstraction, has been widely neglected in Arabic research community despite its advantage in reducing ambiguities within Arabic morphological analysis and its impact on cognitive aspect on Arabic word perception [1]. In the preprocessing phase, the html files were converted to 974 unfiltered raw text files with the size of about 180 GB. These files were morphologically analyzed towards extracting and counting frequencies of patterns, roots, particle, and stems and particularly root-pattern occurrences. Based on a resulting corpus containing around 18,482,719 raw words, a language data model is constructed containing 9,311,246 bi-grams of morphologically analyzed wordform, including around 3.49 million bi-directional P(root|pattern) and around 1.153 million P(patttern|pattern) bi-grams in form of conditional probabilities covering a subset of around 8086 roots with 20413 possible pattern-forms. As this data model is considering the root–pattern phenomenon in Arabic, the created data are useful for researchers working on cognitive aspect of Arabic such as visual word cognition, morpho-phonetic perception, morphological analysis, spell-checking, and resolving ambiguities in morphological parsing.</description><identifier>ISSN: 2352-3409</identifier><identifier>EISSN: 2352-3409</identifier><identifier>DOI: 10.1016/j.dib.2022.108875</identifier><identifier>PMID: 36687158</identifier><language>eng</language><publisher>Netherlands: Elsevier Inc</publisher><subject>Arabic language model ; Data ; N-gram models ; Probabilistic morphology ; Root Pattern Analysis ; Root-Pattern Classification ; Word Cognition</subject><ispartof>Data in brief, 2023-02, Vol.46, p.108875-108875, Article 108875</ispartof><rights>2023 The Author(s)</rights><rights>2023 The Author(s).</rights><rights>2023 The Author(s) 2023</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c517t-f0ed8d58400f43dfb3e1e3187d685fe489f682490d3a38063cb7d76d66a7601c3</citedby><cites>FETCH-LOGICAL-c517t-f0ed8d58400f43dfb3e1e3187d685fe489f682490d3a38063cb7d76d66a7601c3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC9852924/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.sciencedirect.com/science/article/pii/S2352340922010782$$EHTML$$P50$$Gelsevier$$Hfree_for_read</linktohtml><link.rule.ids>230,314,727,780,784,885,3547,27922,27923,45778,53789,53791</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/36687158$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Haddad, Bassam</creatorcontrib><creatorcontrib>Awwad, Ahmad</creatorcontrib><creatorcontrib>Hattab, Mamoun</creatorcontrib><creatorcontrib>Hattab, Ammar</creatorcontrib><title>PRo-Pat: Probabilistic Root–Pattern Bi-gram data language model for Arabic based morphological analysis and distribution</title><title>Data in brief</title><addtitle>Data Brief</addtitle><description>Based on 29,192,662 html files obtained from the ClueWeb a bi-gram data language model for Arabic is constructed. The created dataset is considering standard types of bi-gram analysis, however with focus on the root11An Arabic root depict the basic morpheme of an Arabic word at a higher level of abstraction, representing the basic word meaning. A root morpheme consists predominantly of three consonants (radicals) identifying the highest semantic abstraction and is unchangeable. We use the variables C1, C2 and C3 to represents these radicals.-pattern22An Arabic pattern is templatic shape of consonants (root radicals) and vowels order depicting the morpho-phonetic form of a word fulfilling potential phonetic, syntactic, and semantic data. For example: the pattern maC1C2uC3[3]. paradigm in Arabic. Root-Pattern distributions in form of P(root|pattern), P(pattern|root) and P(pattern|pattern) are additionally estimated. The aspect of considering the Maximum Likelihood Estimation (MLE) on the root-pattern level as a higher-level of abstraction, has been widely neglected in Arabic research community despite its advantage in reducing ambiguities within Arabic morphological analysis and its impact on cognitive aspect on Arabic word perception [1]. In the preprocessing phase, the html files were converted to 974 unfiltered raw text files with the size of about 180 GB. These files were morphologically analyzed towards extracting and counting frequencies of patterns, roots, particle, and stems and particularly root-pattern occurrences. Based on a resulting corpus containing around 18,482,719 raw words, a language data model is constructed containing 9,311,246 bi-grams of morphologically analyzed wordform, including around 3.49 million bi-directional P(root|pattern) and around 1.153 million P(patttern|pattern) bi-grams in form of conditional probabilities covering a subset of around 8086 roots with 20413 possible pattern-forms. As this data model is considering the root–pattern phenomenon in Arabic, the created data are useful for researchers working on cognitive aspect of Arabic such as visual word cognition, morpho-phonetic perception, morphological analysis, spell-checking, and resolving ambiguities in morphological parsing.</description><subject>Arabic language model</subject><subject>Data</subject><subject>N-gram models</subject><subject>Probabilistic morphology</subject><subject>Root Pattern Analysis</subject><subject>Root-Pattern Classification</subject><subject>Word Cognition</subject><issn>2352-3409</issn><issn>2352-3409</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>DOA</sourceid><recordid>eNp9ks1u1DAQxyMEolXpA3BBPnLJ4u84ICG1FR-VKrGq4Gw5nknqVRIvdrZSOfEOvCFPgsuWqr1w8njmPz_b439VvWR0xSjTbzYrCN2KU87L3phGPakOuVC8FpK2Tx_EB9VxzhtKKVOyJNXz6kBobRqmzGH1Y30Z67Vb3pJ1ip3rwhjyEjy5jHH5_fNXqSyYZnIa6iG5iYBbHBndPOzcgGSKgCPpYyInqbR60rmMUNJpexXHOATvRuJmN97kkEsABAo9hW63hDi_qJ71bsx4fLceVd8-fvh69rm--PLp_OzkovaKNUvdUwQDykhKeymg7wQyFMw0oI3qUZq214bLloJwwlAtfNdAo0Fr12jKvDiqzvdciG5jtylMLt3Y6IL9m4hpsC6VN49oteyRdhJAcJC9hE4JxzgooRUIAbyw3u9Z2103IXicl-TGR9DHlTlc2SFe29Yo3nJZAK_vACl-32Fe7BSyx7HMFOMuW95oYxjlDStStpf6FHNO2N8fw6i9tYDd2GIBe2sBu7dA6Xn18H73Hf8-vAje7QVYJn4dMNnsA84eIST0SxlJ-A_-D7k-w-k</recordid><startdate>20230201</startdate><enddate>20230201</enddate><creator>Haddad, Bassam</creator><creator>Awwad, Ahmad</creator><creator>Hattab, Mamoun</creator><creator>Hattab, Ammar</creator><general>Elsevier Inc</general><general>Elsevier</general><scope>6I.</scope><scope>AAFTH</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope></search><sort><creationdate>20230201</creationdate><title>PRo-Pat: Probabilistic Root–Pattern Bi-gram data language model for Arabic based morphological analysis and distribution</title><author>Haddad, Bassam ; Awwad, Ahmad ; Hattab, Mamoun ; Hattab, Ammar</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c517t-f0ed8d58400f43dfb3e1e3187d685fe489f682490d3a38063cb7d76d66a7601c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Arabic language model</topic><topic>Data</topic><topic>N-gram models</topic><topic>Probabilistic morphology</topic><topic>Root Pattern Analysis</topic><topic>Root-Pattern Classification</topic><topic>Word Cognition</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Haddad, Bassam</creatorcontrib><creatorcontrib>Awwad, Ahmad</creatorcontrib><creatorcontrib>Hattab, Mamoun</creatorcontrib><creatorcontrib>Hattab, Ammar</creatorcontrib><collection>ScienceDirect Open Access Titles</collection><collection>Elsevier:ScienceDirect:Open Access</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>Data in brief</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Haddad, Bassam</au><au>Awwad, Ahmad</au><au>Hattab, Mamoun</au><au>Hattab, Ammar</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>PRo-Pat: Probabilistic Root–Pattern Bi-gram data language model for Arabic based morphological analysis and distribution</atitle><jtitle>Data in brief</jtitle><addtitle>Data Brief</addtitle><date>2023-02-01</date><risdate>2023</risdate><volume>46</volume><spage>108875</spage><epage>108875</epage><pages>108875-108875</pages><artnum>108875</artnum><issn>2352-3409</issn><eissn>2352-3409</eissn><abstract>Based on 29,192,662 html files obtained from the ClueWeb a bi-gram data language model for Arabic is constructed. The created dataset is considering standard types of bi-gram analysis, however with focus on the root11An Arabic root depict the basic morpheme of an Arabic word at a higher level of abstraction, representing the basic word meaning. A root morpheme consists predominantly of three consonants (radicals) identifying the highest semantic abstraction and is unchangeable. We use the variables C1, C2 and C3 to represents these radicals.-pattern22An Arabic pattern is templatic shape of consonants (root radicals) and vowels order depicting the morpho-phonetic form of a word fulfilling potential phonetic, syntactic, and semantic data. For example: the pattern maC1C2uC3[3]. paradigm in Arabic. Root-Pattern distributions in form of P(root|pattern), P(pattern|root) and P(pattern|pattern) are additionally estimated. The aspect of considering the Maximum Likelihood Estimation (MLE) on the root-pattern level as a higher-level of abstraction, has been widely neglected in Arabic research community despite its advantage in reducing ambiguities within Arabic morphological analysis and its impact on cognitive aspect on Arabic word perception [1]. In the preprocessing phase, the html files were converted to 974 unfiltered raw text files with the size of about 180 GB. These files were morphologically analyzed towards extracting and counting frequencies of patterns, roots, particle, and stems and particularly root-pattern occurrences. Based on a resulting corpus containing around 18,482,719 raw words, a language data model is constructed containing 9,311,246 bi-grams of morphologically analyzed wordform, including around 3.49 million bi-directional P(root|pattern) and around 1.153 million P(patttern|pattern) bi-grams in form of conditional probabilities covering a subset of around 8086 roots with 20413 possible pattern-forms. As this data model is considering the root–pattern phenomenon in Arabic, the created data are useful for researchers working on cognitive aspect of Arabic such as visual word cognition, morpho-phonetic perception, morphological analysis, spell-checking, and resolving ambiguities in morphological parsing.</abstract><cop>Netherlands</cop><pub>Elsevier Inc</pub><pmid>36687158</pmid><doi>10.1016/j.dib.2022.108875</doi><tpages>1</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2352-3409
ispartof Data in brief, 2023-02, Vol.46, p.108875-108875, Article 108875
issn 2352-3409
2352-3409
language eng
recordid cdi_doaj_primary_oai_doaj_org_article_64fe0b4dd32d4f4db53a12d5365d33d2
source PubMed (Medline); ScienceDirect®
subjects Arabic language model
Data
N-gram models
Probabilistic morphology
Root Pattern Analysis
Root-Pattern Classification
Word Cognition
title PRo-Pat: Probabilistic Root–Pattern Bi-gram data language model for Arabic based morphological analysis and distribution
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-09T19%3A36%3A43IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=PRo-Pat:%20Probabilistic%20Root%E2%80%93Pattern%20Bi-gram%20data%20language%20model%20for%20Arabic%20based%20morphological%20analysis%20and%20distribution&rft.jtitle=Data%20in%20brief&rft.au=Haddad,%20Bassam&rft.date=2023-02-01&rft.volume=46&rft.spage=108875&rft.epage=108875&rft.pages=108875-108875&rft.artnum=108875&rft.issn=2352-3409&rft.eissn=2352-3409&rft_id=info:doi/10.1016/j.dib.2022.108875&rft_dat=%3Cproquest_doaj_%3E2768810271%3C/proquest_doaj_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c517t-f0ed8d58400f43dfb3e1e3187d685fe489f682490d3a38063cb7d76d66a7601c3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2768810271&rft_id=info:pmid/36687158&rfr_iscdi=true