Loading…
PRo-Pat: Probabilistic Root–Pattern Bi-gram data language model for Arabic based morphological analysis and distribution
Based on 29,192,662 html files obtained from the ClueWeb a bi-gram data language model for Arabic is constructed. The created dataset is considering standard types of bi-gram analysis, however with focus on the root11An Arabic root depict the basic morpheme of an Arabic word at a higher level of abs...
Saved in:
Published in: | Data in brief 2023-02, Vol.46, p.108875-108875, Article 108875 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c517t-f0ed8d58400f43dfb3e1e3187d685fe489f682490d3a38063cb7d76d66a7601c3 |
---|---|
cites | cdi_FETCH-LOGICAL-c517t-f0ed8d58400f43dfb3e1e3187d685fe489f682490d3a38063cb7d76d66a7601c3 |
container_end_page | 108875 |
container_issue | |
container_start_page | 108875 |
container_title | Data in brief |
container_volume | 46 |
creator | Haddad, Bassam Awwad, Ahmad Hattab, Mamoun Hattab, Ammar |
description | Based on 29,192,662 html files obtained from the ClueWeb a bi-gram data language model for Arabic is constructed. The created dataset is considering standard types of bi-gram analysis, however with focus on the root11An Arabic root depict the basic morpheme of an Arabic word at a higher level of abstraction, representing the basic word meaning. A root morpheme consists predominantly of three consonants (radicals) identifying the highest semantic abstraction and is unchangeable. We use the variables C1, C2 and C3 to represents these radicals.-pattern22An Arabic pattern is templatic shape of consonants (root radicals) and vowels order depicting the morpho-phonetic form of a word fulfilling potential phonetic, syntactic, and semantic data. For example: the pattern maC1C2uC3[3]. paradigm in Arabic. Root-Pattern distributions in form of P(root|pattern), P(pattern|root) and P(pattern|pattern) are additionally estimated. The aspect of considering the Maximum Likelihood Estimation (MLE) on the root-pattern level as a higher-level of abstraction, has been widely neglected in Arabic research community despite its advantage in reducing ambiguities within Arabic morphological analysis and its impact on cognitive aspect on Arabic word perception [1]. In the preprocessing phase, the html files were converted to 974 unfiltered raw text files with the size of about 180 GB. These files were morphologically analyzed towards extracting and counting frequencies of patterns, roots, particle, and stems and particularly root-pattern occurrences. Based on a resulting corpus containing around 18,482,719 raw words, a language data model is constructed containing 9,311,246 bi-grams of morphologically analyzed wordform, including around 3.49 million bi-directional P(root|pattern) and around 1.153 million P(patttern|pattern) bi-grams in form of conditional probabilities covering a subset of around 8086 roots with 20413 possible pattern-forms. As this data model is considering the root–pattern phenomenon in Arabic, the created data are useful for researchers working on cognitive aspect of Arabic such as visual word cognition, morpho-phonetic perception, morphological analysis, spell-checking, and resolving ambiguities in morphological parsing. |
doi_str_mv | 10.1016/j.dib.2022.108875 |
format | article |
fullrecord | <record><control><sourceid>proquest_doaj_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_64fe0b4dd32d4f4db53a12d5365d33d2</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S2352340922010782</els_id><doaj_id>oai_doaj_org_article_64fe0b4dd32d4f4db53a12d5365d33d2</doaj_id><sourcerecordid>2768810271</sourcerecordid><originalsourceid>FETCH-LOGICAL-c517t-f0ed8d58400f43dfb3e1e3187d685fe489f682490d3a38063cb7d76d66a7601c3</originalsourceid><addsrcrecordid>eNp9ks1u1DAQxyMEolXpA3BBPnLJ4u84ICG1FR-VKrGq4Gw5nknqVRIvdrZSOfEOvCFPgsuWqr1w8njmPz_b439VvWR0xSjTbzYrCN2KU87L3phGPakOuVC8FpK2Tx_EB9VxzhtKKVOyJNXz6kBobRqmzGH1Y30Z67Vb3pJ1ip3rwhjyEjy5jHH5_fNXqSyYZnIa6iG5iYBbHBndPOzcgGSKgCPpYyInqbR60rmMUNJpexXHOATvRuJmN97kkEsABAo9hW63hDi_qJ71bsx4fLceVd8-fvh69rm--PLp_OzkovaKNUvdUwQDykhKeymg7wQyFMw0oI3qUZq214bLloJwwlAtfNdAo0Fr12jKvDiqzvdciG5jtylMLt3Y6IL9m4hpsC6VN49oteyRdhJAcJC9hE4JxzgooRUIAbyw3u9Z2103IXicl-TGR9DHlTlc2SFe29Yo3nJZAK_vACl-32Fe7BSyx7HMFOMuW95oYxjlDStStpf6FHNO2N8fw6i9tYDd2GIBe2sBu7dA6Xn18H73Hf8-vAje7QVYJn4dMNnsA84eIST0SxlJ-A_-D7k-w-k</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2768810271</pqid></control><display><type>article</type><title>PRo-Pat: Probabilistic Root–Pattern Bi-gram data language model for Arabic based morphological analysis and distribution</title><source>PubMed (Medline)</source><source>ScienceDirect®</source><creator>Haddad, Bassam ; Awwad, Ahmad ; Hattab, Mamoun ; Hattab, Ammar</creator><creatorcontrib>Haddad, Bassam ; Awwad, Ahmad ; Hattab, Mamoun ; Hattab, Ammar</creatorcontrib><description>Based on 29,192,662 html files obtained from the ClueWeb a bi-gram data language model for Arabic is constructed. The created dataset is considering standard types of bi-gram analysis, however with focus on the root11An Arabic root depict the basic morpheme of an Arabic word at a higher level of abstraction, representing the basic word meaning. A root morpheme consists predominantly of three consonants (radicals) identifying the highest semantic abstraction and is unchangeable. We use the variables C1, C2 and C3 to represents these radicals.-pattern22An Arabic pattern is templatic shape of consonants (root radicals) and vowels order depicting the morpho-phonetic form of a word fulfilling potential phonetic, syntactic, and semantic data. For example: the pattern maC1C2uC3[3]. paradigm in Arabic. Root-Pattern distributions in form of P(root|pattern), P(pattern|root) and P(pattern|pattern) are additionally estimated. The aspect of considering the Maximum Likelihood Estimation (MLE) on the root-pattern level as a higher-level of abstraction, has been widely neglected in Arabic research community despite its advantage in reducing ambiguities within Arabic morphological analysis and its impact on cognitive aspect on Arabic word perception [1]. In the preprocessing phase, the html files were converted to 974 unfiltered raw text files with the size of about 180 GB. These files were morphologically analyzed towards extracting and counting frequencies of patterns, roots, particle, and stems and particularly root-pattern occurrences. Based on a resulting corpus containing around 18,482,719 raw words, a language data model is constructed containing 9,311,246 bi-grams of morphologically analyzed wordform, including around 3.49 million bi-directional P(root|pattern) and around 1.153 million P(patttern|pattern) bi-grams in form of conditional probabilities covering a subset of around 8086 roots with 20413 possible pattern-forms. As this data model is considering the root–pattern phenomenon in Arabic, the created data are useful for researchers working on cognitive aspect of Arabic such as visual word cognition, morpho-phonetic perception, morphological analysis, spell-checking, and resolving ambiguities in morphological parsing.</description><identifier>ISSN: 2352-3409</identifier><identifier>EISSN: 2352-3409</identifier><identifier>DOI: 10.1016/j.dib.2022.108875</identifier><identifier>PMID: 36687158</identifier><language>eng</language><publisher>Netherlands: Elsevier Inc</publisher><subject>Arabic language model ; Data ; N-gram models ; Probabilistic morphology ; Root Pattern Analysis ; Root-Pattern Classification ; Word Cognition</subject><ispartof>Data in brief, 2023-02, Vol.46, p.108875-108875, Article 108875</ispartof><rights>2023 The Author(s)</rights><rights>2023 The Author(s).</rights><rights>2023 The Author(s) 2023</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c517t-f0ed8d58400f43dfb3e1e3187d685fe489f682490d3a38063cb7d76d66a7601c3</citedby><cites>FETCH-LOGICAL-c517t-f0ed8d58400f43dfb3e1e3187d685fe489f682490d3a38063cb7d76d66a7601c3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC9852924/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.sciencedirect.com/science/article/pii/S2352340922010782$$EHTML$$P50$$Gelsevier$$Hfree_for_read</linktohtml><link.rule.ids>230,314,727,780,784,885,3547,27922,27923,45778,53789,53791</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/36687158$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Haddad, Bassam</creatorcontrib><creatorcontrib>Awwad, Ahmad</creatorcontrib><creatorcontrib>Hattab, Mamoun</creatorcontrib><creatorcontrib>Hattab, Ammar</creatorcontrib><title>PRo-Pat: Probabilistic Root–Pattern Bi-gram data language model for Arabic based morphological analysis and distribution</title><title>Data in brief</title><addtitle>Data Brief</addtitle><description>Based on 29,192,662 html files obtained from the ClueWeb a bi-gram data language model for Arabic is constructed. The created dataset is considering standard types of bi-gram analysis, however with focus on the root11An Arabic root depict the basic morpheme of an Arabic word at a higher level of abstraction, representing the basic word meaning. A root morpheme consists predominantly of three consonants (radicals) identifying the highest semantic abstraction and is unchangeable. We use the variables C1, C2 and C3 to represents these radicals.-pattern22An Arabic pattern is templatic shape of consonants (root radicals) and vowels order depicting the morpho-phonetic form of a word fulfilling potential phonetic, syntactic, and semantic data. For example: the pattern maC1C2uC3[3]. paradigm in Arabic. Root-Pattern distributions in form of P(root|pattern), P(pattern|root) and P(pattern|pattern) are additionally estimated. The aspect of considering the Maximum Likelihood Estimation (MLE) on the root-pattern level as a higher-level of abstraction, has been widely neglected in Arabic research community despite its advantage in reducing ambiguities within Arabic morphological analysis and its impact on cognitive aspect on Arabic word perception [1]. In the preprocessing phase, the html files were converted to 974 unfiltered raw text files with the size of about 180 GB. These files were morphologically analyzed towards extracting and counting frequencies of patterns, roots, particle, and stems and particularly root-pattern occurrences. Based on a resulting corpus containing around 18,482,719 raw words, a language data model is constructed containing 9,311,246 bi-grams of morphologically analyzed wordform, including around 3.49 million bi-directional P(root|pattern) and around 1.153 million P(patttern|pattern) bi-grams in form of conditional probabilities covering a subset of around 8086 roots with 20413 possible pattern-forms. As this data model is considering the root–pattern phenomenon in Arabic, the created data are useful for researchers working on cognitive aspect of Arabic such as visual word cognition, morpho-phonetic perception, morphological analysis, spell-checking, and resolving ambiguities in morphological parsing.</description><subject>Arabic language model</subject><subject>Data</subject><subject>N-gram models</subject><subject>Probabilistic morphology</subject><subject>Root Pattern Analysis</subject><subject>Root-Pattern Classification</subject><subject>Word Cognition</subject><issn>2352-3409</issn><issn>2352-3409</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>DOA</sourceid><recordid>eNp9ks1u1DAQxyMEolXpA3BBPnLJ4u84ICG1FR-VKrGq4Gw5nknqVRIvdrZSOfEOvCFPgsuWqr1w8njmPz_b439VvWR0xSjTbzYrCN2KU87L3phGPakOuVC8FpK2Tx_EB9VxzhtKKVOyJNXz6kBobRqmzGH1Y30Z67Vb3pJ1ip3rwhjyEjy5jHH5_fNXqSyYZnIa6iG5iYBbHBndPOzcgGSKgCPpYyInqbR60rmMUNJpexXHOATvRuJmN97kkEsABAo9hW63hDi_qJ71bsx4fLceVd8-fvh69rm--PLp_OzkovaKNUvdUwQDykhKeymg7wQyFMw0oI3qUZq214bLloJwwlAtfNdAo0Fr12jKvDiqzvdciG5jtylMLt3Y6IL9m4hpsC6VN49oteyRdhJAcJC9hE4JxzgooRUIAbyw3u9Z2103IXicl-TGR9DHlTlc2SFe29Yo3nJZAK_vACl-32Fe7BSyx7HMFOMuW95oYxjlDStStpf6FHNO2N8fw6i9tYDd2GIBe2sBu7dA6Xn18H73Hf8-vAje7QVYJn4dMNnsA84eIST0SxlJ-A_-D7k-w-k</recordid><startdate>20230201</startdate><enddate>20230201</enddate><creator>Haddad, Bassam</creator><creator>Awwad, Ahmad</creator><creator>Hattab, Mamoun</creator><creator>Hattab, Ammar</creator><general>Elsevier Inc</general><general>Elsevier</general><scope>6I.</scope><scope>AAFTH</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope></search><sort><creationdate>20230201</creationdate><title>PRo-Pat: Probabilistic Root–Pattern Bi-gram data language model for Arabic based morphological analysis and distribution</title><author>Haddad, Bassam ; Awwad, Ahmad ; Hattab, Mamoun ; Hattab, Ammar</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c517t-f0ed8d58400f43dfb3e1e3187d685fe489f682490d3a38063cb7d76d66a7601c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Arabic language model</topic><topic>Data</topic><topic>N-gram models</topic><topic>Probabilistic morphology</topic><topic>Root Pattern Analysis</topic><topic>Root-Pattern Classification</topic><topic>Word Cognition</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Haddad, Bassam</creatorcontrib><creatorcontrib>Awwad, Ahmad</creatorcontrib><creatorcontrib>Hattab, Mamoun</creatorcontrib><creatorcontrib>Hattab, Ammar</creatorcontrib><collection>ScienceDirect Open Access Titles</collection><collection>Elsevier:ScienceDirect:Open Access</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>Data in brief</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Haddad, Bassam</au><au>Awwad, Ahmad</au><au>Hattab, Mamoun</au><au>Hattab, Ammar</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>PRo-Pat: Probabilistic Root–Pattern Bi-gram data language model for Arabic based morphological analysis and distribution</atitle><jtitle>Data in brief</jtitle><addtitle>Data Brief</addtitle><date>2023-02-01</date><risdate>2023</risdate><volume>46</volume><spage>108875</spage><epage>108875</epage><pages>108875-108875</pages><artnum>108875</artnum><issn>2352-3409</issn><eissn>2352-3409</eissn><abstract>Based on 29,192,662 html files obtained from the ClueWeb a bi-gram data language model for Arabic is constructed. The created dataset is considering standard types of bi-gram analysis, however with focus on the root11An Arabic root depict the basic morpheme of an Arabic word at a higher level of abstraction, representing the basic word meaning. A root morpheme consists predominantly of three consonants (radicals) identifying the highest semantic abstraction and is unchangeable. We use the variables C1, C2 and C3 to represents these radicals.-pattern22An Arabic pattern is templatic shape of consonants (root radicals) and vowels order depicting the morpho-phonetic form of a word fulfilling potential phonetic, syntactic, and semantic data. For example: the pattern maC1C2uC3[3]. paradigm in Arabic. Root-Pattern distributions in form of P(root|pattern), P(pattern|root) and P(pattern|pattern) are additionally estimated. The aspect of considering the Maximum Likelihood Estimation (MLE) on the root-pattern level as a higher-level of abstraction, has been widely neglected in Arabic research community despite its advantage in reducing ambiguities within Arabic morphological analysis and its impact on cognitive aspect on Arabic word perception [1]. In the preprocessing phase, the html files were converted to 974 unfiltered raw text files with the size of about 180 GB. These files were morphologically analyzed towards extracting and counting frequencies of patterns, roots, particle, and stems and particularly root-pattern occurrences. Based on a resulting corpus containing around 18,482,719 raw words, a language data model is constructed containing 9,311,246 bi-grams of morphologically analyzed wordform, including around 3.49 million bi-directional P(root|pattern) and around 1.153 million P(patttern|pattern) bi-grams in form of conditional probabilities covering a subset of around 8086 roots with 20413 possible pattern-forms. As this data model is considering the root–pattern phenomenon in Arabic, the created data are useful for researchers working on cognitive aspect of Arabic such as visual word cognition, morpho-phonetic perception, morphological analysis, spell-checking, and resolving ambiguities in morphological parsing.</abstract><cop>Netherlands</cop><pub>Elsevier Inc</pub><pmid>36687158</pmid><doi>10.1016/j.dib.2022.108875</doi><tpages>1</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2352-3409 |
ispartof | Data in brief, 2023-02, Vol.46, p.108875-108875, Article 108875 |
issn | 2352-3409 2352-3409 |
language | eng |
recordid | cdi_doaj_primary_oai_doaj_org_article_64fe0b4dd32d4f4db53a12d5365d33d2 |
source | PubMed (Medline); ScienceDirect® |
subjects | Arabic language model Data N-gram models Probabilistic morphology Root Pattern Analysis Root-Pattern Classification Word Cognition |
title | PRo-Pat: Probabilistic Root–Pattern Bi-gram data language model for Arabic based morphological analysis and distribution |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-09T19%3A36%3A43IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=PRo-Pat:%20Probabilistic%20Root%E2%80%93Pattern%20Bi-gram%20data%20language%20model%20for%20Arabic%20based%20morphological%20analysis%20and%20distribution&rft.jtitle=Data%20in%20brief&rft.au=Haddad,%20Bassam&rft.date=2023-02-01&rft.volume=46&rft.spage=108875&rft.epage=108875&rft.pages=108875-108875&rft.artnum=108875&rft.issn=2352-3409&rft.eissn=2352-3409&rft_id=info:doi/10.1016/j.dib.2022.108875&rft_dat=%3Cproquest_doaj_%3E2768810271%3C/proquest_doaj_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c517t-f0ed8d58400f43dfb3e1e3187d685fe489f682490d3a38063cb7d76d66a7601c3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2768810271&rft_id=info:pmid/36687158&rfr_iscdi=true |