Loading…

Alignment of comparable documents: Comparison of similarity measures on French–English–Arabic data

The objective, in this article, is to address the issue of the comparability of documents, which are extracted from different sources and written in different languages. These documents are not necessarily translations of each other. This material is referred as multilingual comparable corpora. Thes...

Full description

Saved in:
Bibliographic Details
Published in:Natural language engineering 2018-09, Vol.24 (5), p.677-694
Main Authors: LANGLOIS, D., SAAD, M., SMAILI, K.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c394t-9bf10fd4608f7af9902f6d5206246f3d5d3578e280cbdd30376bbb3cfdf1665f3
cites cdi_FETCH-LOGICAL-c394t-9bf10fd4608f7af9902f6d5206246f3d5d3578e280cbdd30376bbb3cfdf1665f3
container_end_page 694
container_issue 5
container_start_page 677
container_title Natural language engineering
container_volume 24
creator LANGLOIS, D.
SAAD, M.
SMAILI, K.
description The objective, in this article, is to address the issue of the comparability of documents, which are extracted from different sources and written in different languages. These documents are not necessarily translations of each other. This material is referred as multilingual comparable corpora. These language resources are useful for multilingual natural language processing applications, especially for low-resourced language pairs. In this paper, we collect different data in Arabic, English, and French. Two corpora are built by using available hyperlinks for Wikipedia and Euronews. Euronews is an aligned multilingual (Arabic, English, and French) corpus of 34k documents collected from Euronews website. A more challenging issue is to build comparable corpus from two different and independent media having two distinct editorial lines, such as British Broadcasting Corporation (BBC) and Al Jazeera (JSC). To build such corpus, we propose to use the Cross-Lingual Latent Semantic approach. For this purpose, documents have been harvested from BBC and JSC websites for each month of the years 2012 and 2013. The comparability is calculated for each Arabic–English couple of documents of each month. This automatic task is then validated by hand. This led to a multilingual (Arabic–English) aligned corpus of 305 pairs of documents (233k English words and 137k Arabic words). In addition, a study is presented in this paper to analyze the performance of three methods of the literature allowing to measure the comparability of documents on the multilingual reference corpora. A recall at rank 1 of 50.16 per cent is achieved with the Cross-lingual LSI approach for BBC–JSC test corpus, while the dictionary-based method reaches a recall of only 35.41 per cent.
doi_str_mv 10.1017/S1351324918000232
format article
fullrecord <record><control><sourceid>proquest_hal_p</sourceid><recordid>TN_cdi_hal_primary_oai_HAL_hal_01819710v1</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><cupid>10_1017_S1351324918000232</cupid><sourcerecordid>2080655392</sourcerecordid><originalsourceid>FETCH-LOGICAL-c394t-9bf10fd4608f7af9902f6d5206246f3d5d3578e280cbdd30376bbb3cfdf1665f3</originalsourceid><addsrcrecordid>eNp1kE1OwzAQhSMEEqVwAHaRWLEIzNiJE7OrqpYiVWIBrCMntltX-Sl2gtQdd-CGnASHVrBArDx-73tPowmCS4QbBExvn5AmSEnMMQMAQslRMMKY8ShDhGM_ezsa_NPgzLmNZ2JM41GgJ5VZNbVqurDVYdnWW2FFUalQtmU_yO4unH6rxrXNwDhTm8p_u11YK-F6q1zonblVTbn-fP-YNavKuGGa-CZThlJ04jw40aJy6uLwjoOX-ex5uoiWj_cP08kyKimPu4gXGkHLmEGmU6E5B6KZTAgwEjNNZSJpkmaKZFAWUlKgKSuKgpZaamQs0XQcXO9716LKt9bUwu7yVph8MVnmgwaYIU8R3tCzV3t2a9vXXrku37S9bfx6OYEMWJJQTjyFe6q0rXNW6Z9ahHw4ff7n9D5DDxlRF9bIlfqt_j_1BeYHhzU</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2080655392</pqid></control><display><type>article</type><title>Alignment of comparable documents: Comparison of similarity measures on French–English–Arabic data</title><source>Cambridge Journals Online</source><source>Social Science Premium Collection</source><source>Linguistics Collection</source><source>Linguistics and Language Behavior Abstracts (LLBA)</source><creator>LANGLOIS, D. ; SAAD, M. ; SMAILI, K.</creator><creatorcontrib>LANGLOIS, D. ; SAAD, M. ; SMAILI, K.</creatorcontrib><description>The objective, in this article, is to address the issue of the comparability of documents, which are extracted from different sources and written in different languages. These documents are not necessarily translations of each other. This material is referred as multilingual comparable corpora. These language resources are useful for multilingual natural language processing applications, especially for low-resourced language pairs. In this paper, we collect different data in Arabic, English, and French. Two corpora are built by using available hyperlinks for Wikipedia and Euronews. Euronews is an aligned multilingual (Arabic, English, and French) corpus of 34k documents collected from Euronews website. A more challenging issue is to build comparable corpus from two different and independent media having two distinct editorial lines, such as British Broadcasting Corporation (BBC) and Al Jazeera (JSC). To build such corpus, we propose to use the Cross-Lingual Latent Semantic approach. For this purpose, documents have been harvested from BBC and JSC websites for each month of the years 2012 and 2013. The comparability is calculated for each Arabic–English couple of documents of each month. This automatic task is then validated by hand. This led to a multilingual (Arabic–English) aligned corpus of 305 pairs of documents (233k English words and 137k Arabic words). In addition, a study is presented in this paper to analyze the performance of three methods of the literature allowing to measure the comparability of documents on the multilingual reference corpora. A recall at rank 1 of 50.16 per cent is achieved with the Cross-lingual LSI approach for BBC–JSC test corpus, while the dictionary-based method reaches a recall of only 35.41 per cent.</description><identifier>ISSN: 1351-3249</identifier><identifier>EISSN: 1469-8110</identifier><identifier>DOI: 10.1017/S1351324918000232</identifier><language>eng</language><publisher>Cambridge, UK: Cambridge University Press</publisher><subject>Alignment ; Arabic language ; Bilingualism ; Computation and Language ; Computer Science ; Corpus linguistics ; Dictionaries ; English language ; Experiments ; French language ; Information sources ; Multilingualism ; Natural language processing ; Obama, Barack ; Recall ; Semantics ; Similarity measures ; Translation ; Translations ; Websites</subject><ispartof>Natural language engineering, 2018-09, Vol.24 (5), p.677-694</ispartof><rights>Copyright © Cambridge University Press 2018</rights><rights>Distributed under a Creative Commons Attribution 4.0 International License</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c394t-9bf10fd4608f7af9902f6d5206246f3d5d3578e280cbdd30376bbb3cfdf1665f3</citedby><cites>FETCH-LOGICAL-c394t-9bf10fd4608f7af9902f6d5206246f3d5d3578e280cbdd30376bbb3cfdf1665f3</cites><orcidid>0000-0002-1080-7276</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.proquest.com/docview/2080655392/fulltextPDF?pq-origsite=primo$$EPDF$$P50$$Gproquest$$H</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2080655392?pq-origsite=primo$$EHTML$$P50$$Gproquest$$H</linktohtml><link.rule.ids>230,314,780,784,885,12851,21382,21394,27924,27925,31269,33611,33911,43733,43896,72960,74221,74413</link.rule.ids><backlink>$$Uhttps://hal.science/hal-01819710$$DView record in HAL$$Hfree_for_read</backlink></links><search><creatorcontrib>LANGLOIS, D.</creatorcontrib><creatorcontrib>SAAD, M.</creatorcontrib><creatorcontrib>SMAILI, K.</creatorcontrib><title>Alignment of comparable documents: Comparison of similarity measures on French–English–Arabic data</title><title>Natural language engineering</title><addtitle>Nat. Lang. Eng</addtitle><description>The objective, in this article, is to address the issue of the comparability of documents, which are extracted from different sources and written in different languages. These documents are not necessarily translations of each other. This material is referred as multilingual comparable corpora. These language resources are useful for multilingual natural language processing applications, especially for low-resourced language pairs. In this paper, we collect different data in Arabic, English, and French. Two corpora are built by using available hyperlinks for Wikipedia and Euronews. Euronews is an aligned multilingual (Arabic, English, and French) corpus of 34k documents collected from Euronews website. A more challenging issue is to build comparable corpus from two different and independent media having two distinct editorial lines, such as British Broadcasting Corporation (BBC) and Al Jazeera (JSC). To build such corpus, we propose to use the Cross-Lingual Latent Semantic approach. For this purpose, documents have been harvested from BBC and JSC websites for each month of the years 2012 and 2013. The comparability is calculated for each Arabic–English couple of documents of each month. This automatic task is then validated by hand. This led to a multilingual (Arabic–English) aligned corpus of 305 pairs of documents (233k English words and 137k Arabic words). In addition, a study is presented in this paper to analyze the performance of three methods of the literature allowing to measure the comparability of documents on the multilingual reference corpora. A recall at rank 1 of 50.16 per cent is achieved with the Cross-lingual LSI approach for BBC–JSC test corpus, while the dictionary-based method reaches a recall of only 35.41 per cent.</description><subject>Alignment</subject><subject>Arabic language</subject><subject>Bilingualism</subject><subject>Computation and Language</subject><subject>Computer Science</subject><subject>Corpus linguistics</subject><subject>Dictionaries</subject><subject>English language</subject><subject>Experiments</subject><subject>French language</subject><subject>Information sources</subject><subject>Multilingualism</subject><subject>Natural language processing</subject><subject>Obama, Barack</subject><subject>Recall</subject><subject>Semantics</subject><subject>Similarity measures</subject><subject>Translation</subject><subject>Translations</subject><subject>Websites</subject><issn>1351-3249</issn><issn>1469-8110</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><sourceid>7T9</sourceid><sourceid>ALSLI</sourceid><sourceid>CPGLG</sourceid><recordid>eNp1kE1OwzAQhSMEEqVwAHaRWLEIzNiJE7OrqpYiVWIBrCMntltX-Sl2gtQdd-CGnASHVrBArDx-73tPowmCS4QbBExvn5AmSEnMMQMAQslRMMKY8ShDhGM_ezsa_NPgzLmNZ2JM41GgJ5VZNbVqurDVYdnWW2FFUalQtmU_yO4unH6rxrXNwDhTm8p_u11YK-F6q1zonblVTbn-fP-YNavKuGGa-CZThlJ04jw40aJy6uLwjoOX-ex5uoiWj_cP08kyKimPu4gXGkHLmEGmU6E5B6KZTAgwEjNNZSJpkmaKZFAWUlKgKSuKgpZaamQs0XQcXO9716LKt9bUwu7yVph8MVnmgwaYIU8R3tCzV3t2a9vXXrku37S9bfx6OYEMWJJQTjyFe6q0rXNW6Z9ahHw4ff7n9D5DDxlRF9bIlfqt_j_1BeYHhzU</recordid><startdate>20180901</startdate><enddate>20180901</enddate><creator>LANGLOIS, D.</creator><creator>SAAD, M.</creator><creator>SMAILI, K.</creator><general>Cambridge University Press</general><general>Cambridge University Press (CUP)</general><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7T9</scope><scope>7XB</scope><scope>88G</scope><scope>8AL</scope><scope>8FE</scope><scope>8FG</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ALSLI</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>CPGLG</scope><scope>CRLPW</scope><scope>DWQXO</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>L6V</scope><scope>M0N</scope><scope>M2M</scope><scope>M7S</scope><scope>P5Z</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PSYQQ</scope><scope>PTHSS</scope><scope>Q9U</scope><scope>1XC</scope><scope>VOOES</scope><orcidid>https://orcid.org/0000-0002-1080-7276</orcidid></search><sort><creationdate>20180901</creationdate><title>Alignment of comparable documents: Comparison of similarity measures on French–English–Arabic data</title><author>LANGLOIS, D. ; SAAD, M. ; SMAILI, K.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c394t-9bf10fd4608f7af9902f6d5206246f3d5d3578e280cbdd30376bbb3cfdf1665f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Alignment</topic><topic>Arabic language</topic><topic>Bilingualism</topic><topic>Computation and Language</topic><topic>Computer Science</topic><topic>Corpus linguistics</topic><topic>Dictionaries</topic><topic>English language</topic><topic>Experiments</topic><topic>French language</topic><topic>Information sources</topic><topic>Multilingualism</topic><topic>Natural language processing</topic><topic>Obama, Barack</topic><topic>Recall</topic><topic>Semantics</topic><topic>Similarity measures</topic><topic>Translation</topic><topic>Translations</topic><topic>Websites</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>LANGLOIS, D.</creatorcontrib><creatorcontrib>SAAD, M.</creatorcontrib><creatorcontrib>SMAILI, K.</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Psychology Database (Alumni)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>Social Science Premium Collection</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>Linguistics Collection</collection><collection>Linguistics Database</collection><collection>ProQuest Central Korea</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Engineering Collection</collection><collection>Computing Database</collection><collection>Psychology Database</collection><collection>Engineering Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest One Psychology</collection><collection>Engineering Collection</collection><collection>ProQuest Central Basic</collection><collection>Hyper Article en Ligne (HAL)</collection><collection>Hyper Article en Ligne (HAL) (Open Access)</collection><jtitle>Natural language engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>LANGLOIS, D.</au><au>SAAD, M.</au><au>SMAILI, K.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Alignment of comparable documents: Comparison of similarity measures on French–English–Arabic data</atitle><jtitle>Natural language engineering</jtitle><addtitle>Nat. Lang. Eng</addtitle><date>2018-09-01</date><risdate>2018</risdate><volume>24</volume><issue>5</issue><spage>677</spage><epage>694</epage><pages>677-694</pages><issn>1351-3249</issn><eissn>1469-8110</eissn><abstract>The objective, in this article, is to address the issue of the comparability of documents, which are extracted from different sources and written in different languages. These documents are not necessarily translations of each other. This material is referred as multilingual comparable corpora. These language resources are useful for multilingual natural language processing applications, especially for low-resourced language pairs. In this paper, we collect different data in Arabic, English, and French. Two corpora are built by using available hyperlinks for Wikipedia and Euronews. Euronews is an aligned multilingual (Arabic, English, and French) corpus of 34k documents collected from Euronews website. A more challenging issue is to build comparable corpus from two different and independent media having two distinct editorial lines, such as British Broadcasting Corporation (BBC) and Al Jazeera (JSC). To build such corpus, we propose to use the Cross-Lingual Latent Semantic approach. For this purpose, documents have been harvested from BBC and JSC websites for each month of the years 2012 and 2013. The comparability is calculated for each Arabic–English couple of documents of each month. This automatic task is then validated by hand. This led to a multilingual (Arabic–English) aligned corpus of 305 pairs of documents (233k English words and 137k Arabic words). In addition, a study is presented in this paper to analyze the performance of three methods of the literature allowing to measure the comparability of documents on the multilingual reference corpora. A recall at rank 1 of 50.16 per cent is achieved with the Cross-lingual LSI approach for BBC–JSC test corpus, while the dictionary-based method reaches a recall of only 35.41 per cent.</abstract><cop>Cambridge, UK</cop><pub>Cambridge University Press</pub><doi>10.1017/S1351324918000232</doi><tpages>18</tpages><orcidid>https://orcid.org/0000-0002-1080-7276</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1351-3249
ispartof Natural language engineering, 2018-09, Vol.24 (5), p.677-694
issn 1351-3249
1469-8110
language eng
recordid cdi_hal_primary_oai_HAL_hal_01819710v1
source Cambridge Journals Online; Social Science Premium Collection; Linguistics Collection; Linguistics and Language Behavior Abstracts (LLBA)
subjects Alignment
Arabic language
Bilingualism
Computation and Language
Computer Science
Corpus linguistics
Dictionaries
English language
Experiments
French language
Information sources
Multilingualism
Natural language processing
Obama, Barack
Recall
Semantics
Similarity measures
Translation
Translations
Websites
title Alignment of comparable documents: Comparison of similarity measures on French–English–Arabic data
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T16%3A32%3A50IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_hal_p&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Alignment%20of%20comparable%20documents:%20Comparison%20of%20similarity%20measures%20on%20French%E2%80%93English%E2%80%93Arabic%20data&rft.jtitle=Natural%20language%20engineering&rft.au=LANGLOIS,%20D.&rft.date=2018-09-01&rft.volume=24&rft.issue=5&rft.spage=677&rft.epage=694&rft.pages=677-694&rft.issn=1351-3249&rft.eissn=1469-8110&rft_id=info:doi/10.1017/S1351324918000232&rft_dat=%3Cproquest_hal_p%3E2080655392%3C/proquest_hal_p%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c394t-9bf10fd4608f7af9902f6d5206246f3d5d3578e280cbdd30376bbb3cfdf1665f3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2080655392&rft_id=info:pmid/&rft_cupid=10_1017_S1351324918000232&rfr_iscdi=true