Loading…
Graphonological Levenshtein Edit Distance: Application for Automated Cognate Identification
This paper presents a methodology for calculating a modified Levenshtein edit distance between character strings, and applies it to the task of automated cognate identification from nonparallel (comparable) corpora. This task is an important stage in developing MT systems and bilingual dictionaries...
Saved in:
Published in: | Baltic Journal of Modern Computing 2016-01, Vol.4 (2), p.115-115 |
---|---|
Main Author: | |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | 115 |
container_issue | 2 |
container_start_page | 115 |
container_title | Baltic Journal of Modern Computing |
container_volume | 4 |
creator | Babych, Bogdan |
description | This paper presents a methodology for calculating a modified Levenshtein edit distance between character strings, and applies it to the task of automated cognate identification from nonparallel (comparable) corpora. This task is an important stage in developing MT systems and bilingual dictionaries beyond the coverage of traditionally used aligned parallel corpora, which can be used for finding translation equivalents for the 'long tail' in Zipfian distribution: low-frequency and usually unambiguous lexical items in closely-related languages (many of those often under-resourced). Graphonological Levenshtein edit distance relies on editing hierarchical representations of phonological features for graphemes (graphonological representations) and improves on phonological edit distance proposed for measuring dialectological variation. Graphonological edit distance works directly with character strings and does not require an intermediate stage of phonological transcription, exploiting the advantages of historical and morphological principles of orthography, which are obscured if only phonetic principle is applied. Difficulties associated with plain feature representations (unstructured feature sets or vectors) are addressed by using linguistically-motivated feature hierarchy that restricts matching of lower-level graphonological features when higher-level features are not matched. The paper presents an evaluation of the graphonological edit distance in comparison with the traditional Levenshtein edit distance from the perspective of its usefulness for the task of automated cognate identification. It discusses the advantages of the proposed method, which can be used for morphology induction, for robust transliteration across different alphabets (Latin, Cyrillic, Arabic, etc.) and robust identification of words with non-standard or distorted spelling, e.g., in user-generated content on the web such as posts on social media, blogs and comments. Software for calculating the modified feature-based Levenshtein distance, and the corresponding graphonological feature representations (vectors and the hierarchies of graphemes' features) are released on the author's webpage: http://corpus.leeds.ac.uk/bogdan/phonologylevenshtein/. Features are currently available for Latin and Cyrillic alphabets and will be extended to other alphabets and languages. |
format | article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_miscellaneous_1835574801</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1835574801</sourcerecordid><originalsourceid>FETCH-LOGICAL-p146t-1aba73c14f5311ff21ba8de07635b0755ce27ad6ce26adb9c8052afe83da3f4d3</originalsourceid><addsrcrecordid>eNpdzk9LwzAYBvAiCo657xDw4qXQ_Gs6b6XOOSh40ZOH8bZ5s0W6pDapn9-IYwdPz3P48fBcZQvGpMyrtSyuL12w22wVgu0KIRSnlMlF9rGdYDx65wd_sD0MpMVvdOEY0Tqy0TaSJxsiuB4fST2OQzLRekeMn0g9R3-CiJo0_uBSITuNLlpzRnfZjYEh4Oqcy-z9efPWvOTt63bX1G0-UlHGnEIHivdUGJlOGcNoB5XGQpVcdoWSskemQJcpStDduq8KycBgxTVwIzRfZg9_u-Pkv2YMcX-yocdhAId-DntacSmVqAqa6P0_-unnyaV3SVHGlPxVP7SzX_4</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1812275801</pqid></control><display><type>article</type><title>Graphonological Levenshtein Edit Distance: Application for Automated Cognate Identification</title><source>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</source><creator>Babych, Bogdan</creator><creatorcontrib>Babych, Bogdan</creatorcontrib><description>This paper presents a methodology for calculating a modified Levenshtein edit distance between character strings, and applies it to the task of automated cognate identification from nonparallel (comparable) corpora. This task is an important stage in developing MT systems and bilingual dictionaries beyond the coverage of traditionally used aligned parallel corpora, which can be used for finding translation equivalents for the 'long tail' in Zipfian distribution: low-frequency and usually unambiguous lexical items in closely-related languages (many of those often under-resourced). Graphonological Levenshtein edit distance relies on editing hierarchical representations of phonological features for graphemes (graphonological representations) and improves on phonological edit distance proposed for measuring dialectological variation. Graphonological edit distance works directly with character strings and does not require an intermediate stage of phonological transcription, exploiting the advantages of historical and morphological principles of orthography, which are obscured if only phonetic principle is applied. Difficulties associated with plain feature representations (unstructured feature sets or vectors) are addressed by using linguistically-motivated feature hierarchy that restricts matching of lower-level graphonological features when higher-level features are not matched. The paper presents an evaluation of the graphonological edit distance in comparison with the traditional Levenshtein edit distance from the perspective of its usefulness for the task of automated cognate identification. It discusses the advantages of the proposed method, which can be used for morphology induction, for robust transliteration across different alphabets (Latin, Cyrillic, Arabic, etc.) and robust identification of words with non-standard or distorted spelling, e.g., in user-generated content on the web such as posts on social media, blogs and comments. Software for calculating the modified feature-based Levenshtein distance, and the corresponding graphonological feature representations (vectors and the hierarchies of graphemes' features) are released on the author's webpage: http://corpus.leeds.ac.uk/bogdan/phonologylevenshtein/. Features are currently available for Latin and Cyrillic alphabets and will be extended to other alphabets and languages.</description><identifier>ISSN: 2255-8942</identifier><identifier>EISSN: 2255-8950</identifier><language>eng</language><publisher>Riga: University of Latvia</publisher><subject>Alphabets ; Automation ; Graphical representations ; Hierarchies ; Languages ; Mathematical analysis ; Strings ; Tasks</subject><ispartof>Baltic Journal of Modern Computing, 2016-01, Vol.4 (2), p.115-115</ispartof><rights>Copyright University of Latvia 2016</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.proquest.com/docview/1812275801/fulltextPDF?pq-origsite=primo$$EPDF$$P50$$Gproquest$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/1812275801?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,25753,37012,37013,44590,75126</link.rule.ids></links><search><creatorcontrib>Babych, Bogdan</creatorcontrib><title>Graphonological Levenshtein Edit Distance: Application for Automated Cognate Identification</title><title>Baltic Journal of Modern Computing</title><description>This paper presents a methodology for calculating a modified Levenshtein edit distance between character strings, and applies it to the task of automated cognate identification from nonparallel (comparable) corpora. This task is an important stage in developing MT systems and bilingual dictionaries beyond the coverage of traditionally used aligned parallel corpora, which can be used for finding translation equivalents for the 'long tail' in Zipfian distribution: low-frequency and usually unambiguous lexical items in closely-related languages (many of those often under-resourced). Graphonological Levenshtein edit distance relies on editing hierarchical representations of phonological features for graphemes (graphonological representations) and improves on phonological edit distance proposed for measuring dialectological variation. Graphonological edit distance works directly with character strings and does not require an intermediate stage of phonological transcription, exploiting the advantages of historical and morphological principles of orthography, which are obscured if only phonetic principle is applied. Difficulties associated with plain feature representations (unstructured feature sets or vectors) are addressed by using linguistically-motivated feature hierarchy that restricts matching of lower-level graphonological features when higher-level features are not matched. The paper presents an evaluation of the graphonological edit distance in comparison with the traditional Levenshtein edit distance from the perspective of its usefulness for the task of automated cognate identification. It discusses the advantages of the proposed method, which can be used for morphology induction, for robust transliteration across different alphabets (Latin, Cyrillic, Arabic, etc.) and robust identification of words with non-standard or distorted spelling, e.g., in user-generated content on the web such as posts on social media, blogs and comments. Software for calculating the modified feature-based Levenshtein distance, and the corresponding graphonological feature representations (vectors and the hierarchies of graphemes' features) are released on the author's webpage: http://corpus.leeds.ac.uk/bogdan/phonologylevenshtein/. Features are currently available for Latin and Cyrillic alphabets and will be extended to other alphabets and languages.</description><subject>Alphabets</subject><subject>Automation</subject><subject>Graphical representations</subject><subject>Hierarchies</subject><subject>Languages</subject><subject>Mathematical analysis</subject><subject>Strings</subject><subject>Tasks</subject><issn>2255-8942</issn><issn>2255-8950</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2016</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNpdzk9LwzAYBvAiCo657xDw4qXQ_Gs6b6XOOSh40ZOH8bZ5s0W6pDapn9-IYwdPz3P48fBcZQvGpMyrtSyuL12w22wVgu0KIRSnlMlF9rGdYDx65wd_sD0MpMVvdOEY0Tqy0TaSJxsiuB4fST2OQzLRekeMn0g9R3-CiJo0_uBSITuNLlpzRnfZjYEh4Oqcy-z9efPWvOTt63bX1G0-UlHGnEIHivdUGJlOGcNoB5XGQpVcdoWSskemQJcpStDduq8KycBgxTVwIzRfZg9_u-Pkv2YMcX-yocdhAId-DntacSmVqAqa6P0_-unnyaV3SVHGlPxVP7SzX_4</recordid><startdate>20160101</startdate><enddate>20160101</enddate><creator>Babych, Bogdan</creator><general>University of Latvia</general><scope>3V.</scope><scope>7SC</scope><scope>7XB</scope><scope>8AL</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>M0N</scope><scope>P5Z</scope><scope>P62</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>Q9U</scope></search><sort><creationdate>20160101</creationdate><title>Graphonological Levenshtein Edit Distance: Application for Automated Cognate Identification</title><author>Babych, Bogdan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-p146t-1aba73c14f5311ff21ba8de07635b0755ce27ad6ce26adb9c8052afe83da3f4d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2016</creationdate><topic>Alphabets</topic><topic>Automation</topic><topic>Graphical representations</topic><topic>Hierarchies</topic><topic>Languages</topic><topic>Mathematical analysis</topic><topic>Strings</topic><topic>Tasks</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Babych, Bogdan</creatorcontrib><collection>ProQuest Central (Corporate)</collection><collection>Computer and Information Systems Abstracts</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Computing Database (Alumni Edition)</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Computing Database</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>ProQuest Central Basic</collection><jtitle>Baltic Journal of Modern Computing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Babych, Bogdan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Graphonological Levenshtein Edit Distance: Application for Automated Cognate Identification</atitle><jtitle>Baltic Journal of Modern Computing</jtitle><date>2016-01-01</date><risdate>2016</risdate><volume>4</volume><issue>2</issue><spage>115</spage><epage>115</epage><pages>115-115</pages><issn>2255-8942</issn><eissn>2255-8950</eissn><abstract>This paper presents a methodology for calculating a modified Levenshtein edit distance between character strings, and applies it to the task of automated cognate identification from nonparallel (comparable) corpora. This task is an important stage in developing MT systems and bilingual dictionaries beyond the coverage of traditionally used aligned parallel corpora, which can be used for finding translation equivalents for the 'long tail' in Zipfian distribution: low-frequency and usually unambiguous lexical items in closely-related languages (many of those often under-resourced). Graphonological Levenshtein edit distance relies on editing hierarchical representations of phonological features for graphemes (graphonological representations) and improves on phonological edit distance proposed for measuring dialectological variation. Graphonological edit distance works directly with character strings and does not require an intermediate stage of phonological transcription, exploiting the advantages of historical and morphological principles of orthography, which are obscured if only phonetic principle is applied. Difficulties associated with plain feature representations (unstructured feature sets or vectors) are addressed by using linguistically-motivated feature hierarchy that restricts matching of lower-level graphonological features when higher-level features are not matched. The paper presents an evaluation of the graphonological edit distance in comparison with the traditional Levenshtein edit distance from the perspective of its usefulness for the task of automated cognate identification. It discusses the advantages of the proposed method, which can be used for morphology induction, for robust transliteration across different alphabets (Latin, Cyrillic, Arabic, etc.) and robust identification of words with non-standard or distorted spelling, e.g., in user-generated content on the web such as posts on social media, blogs and comments. Software for calculating the modified feature-based Levenshtein distance, and the corresponding graphonological feature representations (vectors and the hierarchies of graphemes' features) are released on the author's webpage: http://corpus.leeds.ac.uk/bogdan/phonologylevenshtein/. Features are currently available for Latin and Cyrillic alphabets and will be extended to other alphabets and languages.</abstract><cop>Riga</cop><pub>University of Latvia</pub><tpages>1</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2255-8942 |
ispartof | Baltic Journal of Modern Computing, 2016-01, Vol.4 (2), p.115-115 |
issn | 2255-8942 2255-8950 |
language | eng |
recordid | cdi_proquest_miscellaneous_1835574801 |
source | Publicly Available Content Database (Proquest) (PQ_SDU_P3) |
subjects | Alphabets Automation Graphical representations Hierarchies Languages Mathematical analysis Strings Tasks |
title | Graphonological Levenshtein Edit Distance: Application for Automated Cognate Identification |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T10%3A13%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Graphonological%20Levenshtein%20Edit%20Distance:%20Application%20for%20Automated%20Cognate%20Identification&rft.jtitle=Baltic%20Journal%20of%20Modern%20Computing&rft.au=Babych,%20Bogdan&rft.date=2016-01-01&rft.volume=4&rft.issue=2&rft.spage=115&rft.epage=115&rft.pages=115-115&rft.issn=2255-8942&rft.eissn=2255-8950&rft_id=info:doi/&rft_dat=%3Cproquest%3E1835574801%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-p146t-1aba73c14f5311ff21ba8de07635b0755ce27ad6ce26adb9c8052afe83da3f4d3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=1812275801&rft_id=info:pmid/&rfr_iscdi=true |