Loading…

Graphonological Levenshtein Edit Distance: Application for Automated Cognate Identification

This paper presents a methodology for calculating a modified Levenshtein edit distance between character strings, and applies it to the task of automated cognate identification from nonparallel (comparable) corpora. This task is an important stage in developing MT systems and bilingual dictionaries...

Full description

Saved in:
Bibliographic Details
Published in:Baltic Journal of Modern Computing 2016-01, Vol.4 (2), p.115-115
Main Author: Babych, Bogdan
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page 115
container_issue 2
container_start_page 115
container_title Baltic Journal of Modern Computing
container_volume 4
creator Babych, Bogdan
description This paper presents a methodology for calculating a modified Levenshtein edit distance between character strings, and applies it to the task of automated cognate identification from nonparallel (comparable) corpora. This task is an important stage in developing MT systems and bilingual dictionaries beyond the coverage of traditionally used aligned parallel corpora, which can be used for finding translation equivalents for the 'long tail' in Zipfian distribution: low-frequency and usually unambiguous lexical items in closely-related languages (many of those often under-resourced). Graphonological Levenshtein edit distance relies on editing hierarchical representations of phonological features for graphemes (graphonological representations) and improves on phonological edit distance proposed for measuring dialectological variation. Graphonological edit distance works directly with character strings and does not require an intermediate stage of phonological transcription, exploiting the advantages of historical and morphological principles of orthography, which are obscured if only phonetic principle is applied. Difficulties associated with plain feature representations (unstructured feature sets or vectors) are addressed by using linguistically-motivated feature hierarchy that restricts matching of lower-level graphonological features when higher-level features are not matched. The paper presents an evaluation of the graphonological edit distance in comparison with the traditional Levenshtein edit distance from the perspective of its usefulness for the task of automated cognate identification. It discusses the advantages of the proposed method, which can be used for morphology induction, for robust transliteration across different alphabets (Latin, Cyrillic, Arabic, etc.) and robust identification of words with non-standard or distorted spelling, e.g., in user-generated content on the web such as posts on social media, blogs and comments. Software for calculating the modified feature-based Levenshtein distance, and the corresponding graphonological feature representations (vectors and the hierarchies of graphemes' features) are released on the author's webpage: http://corpus.leeds.ac.uk/bogdan/phonologylevenshtein/. Features are currently available for Latin and Cyrillic alphabets and will be extended to other alphabets and languages.
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_miscellaneous_1835574801</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1835574801</sourcerecordid><originalsourceid>FETCH-LOGICAL-p146t-1aba73c14f5311ff21ba8de07635b0755ce27ad6ce26adb9c8052afe83da3f4d3</originalsourceid><addsrcrecordid>eNpdzk9LwzAYBvAiCo657xDw4qXQ_Gs6b6XOOSh40ZOH8bZ5s0W6pDapn9-IYwdPz3P48fBcZQvGpMyrtSyuL12w22wVgu0KIRSnlMlF9rGdYDx65wd_sD0MpMVvdOEY0Tqy0TaSJxsiuB4fST2OQzLRekeMn0g9R3-CiJo0_uBSITuNLlpzRnfZjYEh4Oqcy-z9efPWvOTt63bX1G0-UlHGnEIHivdUGJlOGcNoB5XGQpVcdoWSskemQJcpStDduq8KycBgxTVwIzRfZg9_u-Pkv2YMcX-yocdhAId-DntacSmVqAqa6P0_-unnyaV3SVHGlPxVP7SzX_4</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1812275801</pqid></control><display><type>article</type><title>Graphonological Levenshtein Edit Distance: Application for Automated Cognate Identification</title><source>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</source><creator>Babych, Bogdan</creator><creatorcontrib>Babych, Bogdan</creatorcontrib><description>This paper presents a methodology for calculating a modified Levenshtein edit distance between character strings, and applies it to the task of automated cognate identification from nonparallel (comparable) corpora. This task is an important stage in developing MT systems and bilingual dictionaries beyond the coverage of traditionally used aligned parallel corpora, which can be used for finding translation equivalents for the 'long tail' in Zipfian distribution: low-frequency and usually unambiguous lexical items in closely-related languages (many of those often under-resourced). Graphonological Levenshtein edit distance relies on editing hierarchical representations of phonological features for graphemes (graphonological representations) and improves on phonological edit distance proposed for measuring dialectological variation. Graphonological edit distance works directly with character strings and does not require an intermediate stage of phonological transcription, exploiting the advantages of historical and morphological principles of orthography, which are obscured if only phonetic principle is applied. Difficulties associated with plain feature representations (unstructured feature sets or vectors) are addressed by using linguistically-motivated feature hierarchy that restricts matching of lower-level graphonological features when higher-level features are not matched. The paper presents an evaluation of the graphonological edit distance in comparison with the traditional Levenshtein edit distance from the perspective of its usefulness for the task of automated cognate identification. It discusses the advantages of the proposed method, which can be used for morphology induction, for robust transliteration across different alphabets (Latin, Cyrillic, Arabic, etc.) and robust identification of words with non-standard or distorted spelling, e.g., in user-generated content on the web such as posts on social media, blogs and comments. Software for calculating the modified feature-based Levenshtein distance, and the corresponding graphonological feature representations (vectors and the hierarchies of graphemes' features) are released on the author's webpage: http://corpus.leeds.ac.uk/bogdan/phonologylevenshtein/. Features are currently available for Latin and Cyrillic alphabets and will be extended to other alphabets and languages.</description><identifier>ISSN: 2255-8942</identifier><identifier>EISSN: 2255-8950</identifier><language>eng</language><publisher>Riga: University of Latvia</publisher><subject>Alphabets ; Automation ; Graphical representations ; Hierarchies ; Languages ; Mathematical analysis ; Strings ; Tasks</subject><ispartof>Baltic Journal of Modern Computing, 2016-01, Vol.4 (2), p.115-115</ispartof><rights>Copyright University of Latvia 2016</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.proquest.com/docview/1812275801/fulltextPDF?pq-origsite=primo$$EPDF$$P50$$Gproquest$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/1812275801?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,25753,37012,37013,44590,75126</link.rule.ids></links><search><creatorcontrib>Babych, Bogdan</creatorcontrib><title>Graphonological Levenshtein Edit Distance: Application for Automated Cognate Identification</title><title>Baltic Journal of Modern Computing</title><description>This paper presents a methodology for calculating a modified Levenshtein edit distance between character strings, and applies it to the task of automated cognate identification from nonparallel (comparable) corpora. This task is an important stage in developing MT systems and bilingual dictionaries beyond the coverage of traditionally used aligned parallel corpora, which can be used for finding translation equivalents for the 'long tail' in Zipfian distribution: low-frequency and usually unambiguous lexical items in closely-related languages (many of those often under-resourced). Graphonological Levenshtein edit distance relies on editing hierarchical representations of phonological features for graphemes (graphonological representations) and improves on phonological edit distance proposed for measuring dialectological variation. Graphonological edit distance works directly with character strings and does not require an intermediate stage of phonological transcription, exploiting the advantages of historical and morphological principles of orthography, which are obscured if only phonetic principle is applied. Difficulties associated with plain feature representations (unstructured feature sets or vectors) are addressed by using linguistically-motivated feature hierarchy that restricts matching of lower-level graphonological features when higher-level features are not matched. The paper presents an evaluation of the graphonological edit distance in comparison with the traditional Levenshtein edit distance from the perspective of its usefulness for the task of automated cognate identification. It discusses the advantages of the proposed method, which can be used for morphology induction, for robust transliteration across different alphabets (Latin, Cyrillic, Arabic, etc.) and robust identification of words with non-standard or distorted spelling, e.g., in user-generated content on the web such as posts on social media, blogs and comments. Software for calculating the modified feature-based Levenshtein distance, and the corresponding graphonological feature representations (vectors and the hierarchies of graphemes' features) are released on the author's webpage: http://corpus.leeds.ac.uk/bogdan/phonologylevenshtein/. Features are currently available for Latin and Cyrillic alphabets and will be extended to other alphabets and languages.</description><subject>Alphabets</subject><subject>Automation</subject><subject>Graphical representations</subject><subject>Hierarchies</subject><subject>Languages</subject><subject>Mathematical analysis</subject><subject>Strings</subject><subject>Tasks</subject><issn>2255-8942</issn><issn>2255-8950</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2016</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNpdzk9LwzAYBvAiCo657xDw4qXQ_Gs6b6XOOSh40ZOH8bZ5s0W6pDapn9-IYwdPz3P48fBcZQvGpMyrtSyuL12w22wVgu0KIRSnlMlF9rGdYDx65wd_sD0MpMVvdOEY0Tqy0TaSJxsiuB4fST2OQzLRekeMn0g9R3-CiJo0_uBSITuNLlpzRnfZjYEh4Oqcy-z9efPWvOTt63bX1G0-UlHGnEIHivdUGJlOGcNoB5XGQpVcdoWSskemQJcpStDduq8KycBgxTVwIzRfZg9_u-Pkv2YMcX-yocdhAId-DntacSmVqAqa6P0_-unnyaV3SVHGlPxVP7SzX_4</recordid><startdate>20160101</startdate><enddate>20160101</enddate><creator>Babych, Bogdan</creator><general>University of Latvia</general><scope>3V.</scope><scope>7SC</scope><scope>7XB</scope><scope>8AL</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>M0N</scope><scope>P5Z</scope><scope>P62</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>Q9U</scope></search><sort><creationdate>20160101</creationdate><title>Graphonological Levenshtein Edit Distance: Application for Automated Cognate Identification</title><author>Babych, Bogdan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-p146t-1aba73c14f5311ff21ba8de07635b0755ce27ad6ce26adb9c8052afe83da3f4d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2016</creationdate><topic>Alphabets</topic><topic>Automation</topic><topic>Graphical representations</topic><topic>Hierarchies</topic><topic>Languages</topic><topic>Mathematical analysis</topic><topic>Strings</topic><topic>Tasks</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Babych, Bogdan</creatorcontrib><collection>ProQuest Central (Corporate)</collection><collection>Computer and Information Systems Abstracts</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Computing Database (Alumni Edition)</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Computing Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>ProQuest Central Basic</collection><jtitle>Baltic Journal of Modern Computing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Babych, Bogdan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Graphonological Levenshtein Edit Distance: Application for Automated Cognate Identification</atitle><jtitle>Baltic Journal of Modern Computing</jtitle><date>2016-01-01</date><risdate>2016</risdate><volume>4</volume><issue>2</issue><spage>115</spage><epage>115</epage><pages>115-115</pages><issn>2255-8942</issn><eissn>2255-8950</eissn><abstract>This paper presents a methodology for calculating a modified Levenshtein edit distance between character strings, and applies it to the task of automated cognate identification from nonparallel (comparable) corpora. This task is an important stage in developing MT systems and bilingual dictionaries beyond the coverage of traditionally used aligned parallel corpora, which can be used for finding translation equivalents for the 'long tail' in Zipfian distribution: low-frequency and usually unambiguous lexical items in closely-related languages (many of those often under-resourced). Graphonological Levenshtein edit distance relies on editing hierarchical representations of phonological features for graphemes (graphonological representations) and improves on phonological edit distance proposed for measuring dialectological variation. Graphonological edit distance works directly with character strings and does not require an intermediate stage of phonological transcription, exploiting the advantages of historical and morphological principles of orthography, which are obscured if only phonetic principle is applied. Difficulties associated with plain feature representations (unstructured feature sets or vectors) are addressed by using linguistically-motivated feature hierarchy that restricts matching of lower-level graphonological features when higher-level features are not matched. The paper presents an evaluation of the graphonological edit distance in comparison with the traditional Levenshtein edit distance from the perspective of its usefulness for the task of automated cognate identification. It discusses the advantages of the proposed method, which can be used for morphology induction, for robust transliteration across different alphabets (Latin, Cyrillic, Arabic, etc.) and robust identification of words with non-standard or distorted spelling, e.g., in user-generated content on the web such as posts on social media, blogs and comments. Software for calculating the modified feature-based Levenshtein distance, and the corresponding graphonological feature representations (vectors and the hierarchies of graphemes' features) are released on the author's webpage: http://corpus.leeds.ac.uk/bogdan/phonologylevenshtein/. Features are currently available for Latin and Cyrillic alphabets and will be extended to other alphabets and languages.</abstract><cop>Riga</cop><pub>University of Latvia</pub><tpages>1</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2255-8942
ispartof Baltic Journal of Modern Computing, 2016-01, Vol.4 (2), p.115-115
issn 2255-8942
2255-8950
language eng
recordid cdi_proquest_miscellaneous_1835574801
source Publicly Available Content Database (Proquest) (PQ_SDU_P3)
subjects Alphabets
Automation
Graphical representations
Hierarchies
Languages
Mathematical analysis
Strings
Tasks
title Graphonological Levenshtein Edit Distance: Application for Automated Cognate Identification
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T10%3A13%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Graphonological%20Levenshtein%20Edit%20Distance:%20Application%20for%20Automated%20Cognate%20Identification&rft.jtitle=Baltic%20Journal%20of%20Modern%20Computing&rft.au=Babych,%20Bogdan&rft.date=2016-01-01&rft.volume=4&rft.issue=2&rft.spage=115&rft.epage=115&rft.pages=115-115&rft.issn=2255-8942&rft.eissn=2255-8950&rft_id=info:doi/&rft_dat=%3Cproquest%3E1835574801%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-p146t-1aba73c14f5311ff21ba8de07635b0755ce27ad6ce26adb9c8052afe83da3f4d3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=1812275801&rft_id=info:pmid/&rfr_iscdi=true