Loading…
Alignment of comparable documents: Comparison of similarity measures on French–English–Arabic data
The objective, in this article, is to address the issue of the comparability of documents, which are extracted from different sources and written in different languages. These documents are not necessarily translations of each other. This material is referred as multilingual comparable corpora. Thes...
Saved in:
Published in: | Natural language engineering 2018-09, Vol.24 (5), p.677-694 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c394t-9bf10fd4608f7af9902f6d5206246f3d5d3578e280cbdd30376bbb3cfdf1665f3 |
---|---|
cites | cdi_FETCH-LOGICAL-c394t-9bf10fd4608f7af9902f6d5206246f3d5d3578e280cbdd30376bbb3cfdf1665f3 |
container_end_page | 694 |
container_issue | 5 |
container_start_page | 677 |
container_title | Natural language engineering |
container_volume | 24 |
creator | LANGLOIS, D. SAAD, M. SMAILI, K. |
description | The objective, in this article, is to address the issue of the comparability of documents, which are extracted from different sources and written in different languages. These documents are not necessarily translations of each other. This material is referred as multilingual comparable corpora. These language resources are useful for multilingual natural language processing applications, especially for low-resourced language pairs. In this paper, we collect different data in Arabic, English, and French. Two corpora are built by using available hyperlinks for Wikipedia and Euronews. Euronews is an aligned multilingual (Arabic, English, and French) corpus of 34k documents collected from Euronews website. A more challenging issue is to build comparable corpus from two different and independent media having two distinct editorial lines, such as British Broadcasting Corporation (BBC) and Al Jazeera (JSC). To build such corpus, we propose to use the Cross-Lingual Latent Semantic approach. For this purpose, documents have been harvested from BBC and JSC websites for each month of the years 2012 and 2013. The comparability is calculated for each Arabic–English couple of documents of each month. This automatic task is then validated by hand. This led to a multilingual (Arabic–English) aligned corpus of 305 pairs of documents (233k English words and 137k Arabic words). In addition, a study is presented in this paper to analyze the performance of three methods of the literature allowing to measure the comparability of documents on the multilingual reference corpora. A recall at rank 1 of 50.16 per cent is achieved with the Cross-lingual LSI approach for BBC–JSC test corpus, while the dictionary-based method reaches a recall of only 35.41 per cent. |
doi_str_mv | 10.1017/S1351324918000232 |
format | article |
fullrecord | <record><control><sourceid>proquest_hal_p</sourceid><recordid>TN_cdi_hal_primary_oai_HAL_hal_01819710v1</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><cupid>10_1017_S1351324918000232</cupid><sourcerecordid>2080655392</sourcerecordid><originalsourceid>FETCH-LOGICAL-c394t-9bf10fd4608f7af9902f6d5206246f3d5d3578e280cbdd30376bbb3cfdf1665f3</originalsourceid><addsrcrecordid>eNp1kE1OwzAQhSMEEqVwAHaRWLEIzNiJE7OrqpYiVWIBrCMntltX-Sl2gtQdd-CGnASHVrBArDx-73tPowmCS4QbBExvn5AmSEnMMQMAQslRMMKY8ShDhGM_ezsa_NPgzLmNZ2JM41GgJ5VZNbVqurDVYdnWW2FFUalQtmU_yO4unH6rxrXNwDhTm8p_u11YK-F6q1zonblVTbn-fP-YNavKuGGa-CZThlJ04jw40aJy6uLwjoOX-ex5uoiWj_cP08kyKimPu4gXGkHLmEGmU6E5B6KZTAgwEjNNZSJpkmaKZFAWUlKgKSuKgpZaamQs0XQcXO9716LKt9bUwu7yVph8MVnmgwaYIU8R3tCzV3t2a9vXXrku37S9bfx6OYEMWJJQTjyFe6q0rXNW6Z9ahHw4ff7n9D5DDxlRF9bIlfqt_j_1BeYHhzU</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2080655392</pqid></control><display><type>article</type><title>Alignment of comparable documents: Comparison of similarity measures on French–English–Arabic data</title><source>Cambridge Journals Online</source><source>Social Science Premium Collection</source><source>Linguistics Collection</source><source>Linguistics and Language Behavior Abstracts (LLBA)</source><creator>LANGLOIS, D. ; SAAD, M. ; SMAILI, K.</creator><creatorcontrib>LANGLOIS, D. ; SAAD, M. ; SMAILI, K.</creatorcontrib><description>The objective, in this article, is to address the issue of the comparability of documents, which are extracted from different sources and written in different languages. These documents are not necessarily translations of each other. This material is referred as multilingual comparable corpora. These language resources are useful for multilingual natural language processing applications, especially for low-resourced language pairs. In this paper, we collect different data in Arabic, English, and French. Two corpora are built by using available hyperlinks for Wikipedia and Euronews. Euronews is an aligned multilingual (Arabic, English, and French) corpus of 34k documents collected from Euronews website. A more challenging issue is to build comparable corpus from two different and independent media having two distinct editorial lines, such as British Broadcasting Corporation (BBC) and Al Jazeera (JSC). To build such corpus, we propose to use the Cross-Lingual Latent Semantic approach. For this purpose, documents have been harvested from BBC and JSC websites for each month of the years 2012 and 2013. The comparability is calculated for each Arabic–English couple of documents of each month. This automatic task is then validated by hand. This led to a multilingual (Arabic–English) aligned corpus of 305 pairs of documents (233k English words and 137k Arabic words). In addition, a study is presented in this paper to analyze the performance of three methods of the literature allowing to measure the comparability of documents on the multilingual reference corpora. A recall at rank 1 of 50.16 per cent is achieved with the Cross-lingual LSI approach for BBC–JSC test corpus, while the dictionary-based method reaches a recall of only 35.41 per cent.</description><identifier>ISSN: 1351-3249</identifier><identifier>EISSN: 1469-8110</identifier><identifier>DOI: 10.1017/S1351324918000232</identifier><language>eng</language><publisher>Cambridge, UK: Cambridge University Press</publisher><subject>Alignment ; Arabic language ; Bilingualism ; Computation and Language ; Computer Science ; Corpus linguistics ; Dictionaries ; English language ; Experiments ; French language ; Information sources ; Multilingualism ; Natural language processing ; Obama, Barack ; Recall ; Semantics ; Similarity measures ; Translation ; Translations ; Websites</subject><ispartof>Natural language engineering, 2018-09, Vol.24 (5), p.677-694</ispartof><rights>Copyright © Cambridge University Press 2018</rights><rights>Distributed under a Creative Commons Attribution 4.0 International License</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c394t-9bf10fd4608f7af9902f6d5206246f3d5d3578e280cbdd30376bbb3cfdf1665f3</citedby><cites>FETCH-LOGICAL-c394t-9bf10fd4608f7af9902f6d5206246f3d5d3578e280cbdd30376bbb3cfdf1665f3</cites><orcidid>0000-0002-1080-7276</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.proquest.com/docview/2080655392/fulltextPDF?pq-origsite=primo$$EPDF$$P50$$Gproquest$$H</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2080655392?pq-origsite=primo$$EHTML$$P50$$Gproquest$$H</linktohtml><link.rule.ids>230,314,780,784,885,12851,21382,21394,27924,27925,31269,33611,33911,43733,43896,72960,74221,74413</link.rule.ids><backlink>$$Uhttps://hal.science/hal-01819710$$DView record in HAL$$Hfree_for_read</backlink></links><search><creatorcontrib>LANGLOIS, D.</creatorcontrib><creatorcontrib>SAAD, M.</creatorcontrib><creatorcontrib>SMAILI, K.</creatorcontrib><title>Alignment of comparable documents: Comparison of similarity measures on French–English–Arabic data</title><title>Natural language engineering</title><addtitle>Nat. Lang. Eng</addtitle><description>The objective, in this article, is to address the issue of the comparability of documents, which are extracted from different sources and written in different languages. These documents are not necessarily translations of each other. This material is referred as multilingual comparable corpora. These language resources are useful for multilingual natural language processing applications, especially for low-resourced language pairs. In this paper, we collect different data in Arabic, English, and French. Two corpora are built by using available hyperlinks for Wikipedia and Euronews. Euronews is an aligned multilingual (Arabic, English, and French) corpus of 34k documents collected from Euronews website. A more challenging issue is to build comparable corpus from two different and independent media having two distinct editorial lines, such as British Broadcasting Corporation (BBC) and Al Jazeera (JSC). To build such corpus, we propose to use the Cross-Lingual Latent Semantic approach. For this purpose, documents have been harvested from BBC and JSC websites for each month of the years 2012 and 2013. The comparability is calculated for each Arabic–English couple of documents of each month. This automatic task is then validated by hand. This led to a multilingual (Arabic–English) aligned corpus of 305 pairs of documents (233k English words and 137k Arabic words). In addition, a study is presented in this paper to analyze the performance of three methods of the literature allowing to measure the comparability of documents on the multilingual reference corpora. A recall at rank 1 of 50.16 per cent is achieved with the Cross-lingual LSI approach for BBC–JSC test corpus, while the dictionary-based method reaches a recall of only 35.41 per cent.</description><subject>Alignment</subject><subject>Arabic language</subject><subject>Bilingualism</subject><subject>Computation and Language</subject><subject>Computer Science</subject><subject>Corpus linguistics</subject><subject>Dictionaries</subject><subject>English language</subject><subject>Experiments</subject><subject>French language</subject><subject>Information sources</subject><subject>Multilingualism</subject><subject>Natural language processing</subject><subject>Obama, Barack</subject><subject>Recall</subject><subject>Semantics</subject><subject>Similarity measures</subject><subject>Translation</subject><subject>Translations</subject><subject>Websites</subject><issn>1351-3249</issn><issn>1469-8110</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><sourceid>7T9</sourceid><sourceid>ALSLI</sourceid><sourceid>CPGLG</sourceid><recordid>eNp1kE1OwzAQhSMEEqVwAHaRWLEIzNiJE7OrqpYiVWIBrCMntltX-Sl2gtQdd-CGnASHVrBArDx-73tPowmCS4QbBExvn5AmSEnMMQMAQslRMMKY8ShDhGM_ezsa_NPgzLmNZ2JM41GgJ5VZNbVqurDVYdnWW2FFUalQtmU_yO4unH6rxrXNwDhTm8p_u11YK-F6q1zonblVTbn-fP-YNavKuGGa-CZThlJ04jw40aJy6uLwjoOX-ex5uoiWj_cP08kyKimPu4gXGkHLmEGmU6E5B6KZTAgwEjNNZSJpkmaKZFAWUlKgKSuKgpZaamQs0XQcXO9716LKt9bUwu7yVph8MVnmgwaYIU8R3tCzV3t2a9vXXrku37S9bfx6OYEMWJJQTjyFe6q0rXNW6Z9ahHw4ff7n9D5DDxlRF9bIlfqt_j_1BeYHhzU</recordid><startdate>20180901</startdate><enddate>20180901</enddate><creator>LANGLOIS, D.</creator><creator>SAAD, M.</creator><creator>SMAILI, K.</creator><general>Cambridge University Press</general><general>Cambridge University Press (CUP)</general><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7T9</scope><scope>7XB</scope><scope>88G</scope><scope>8AL</scope><scope>8FE</scope><scope>8FG</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ALSLI</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>CPGLG</scope><scope>CRLPW</scope><scope>DWQXO</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>L6V</scope><scope>M0N</scope><scope>M2M</scope><scope>M7S</scope><scope>P5Z</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PSYQQ</scope><scope>PTHSS</scope><scope>Q9U</scope><scope>1XC</scope><scope>VOOES</scope><orcidid>https://orcid.org/0000-0002-1080-7276</orcidid></search><sort><creationdate>20180901</creationdate><title>Alignment of comparable documents: Comparison of similarity measures on French–English–Arabic data</title><author>LANGLOIS, D. ; SAAD, M. ; SMAILI, K.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c394t-9bf10fd4608f7af9902f6d5206246f3d5d3578e280cbdd30376bbb3cfdf1665f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Alignment</topic><topic>Arabic language</topic><topic>Bilingualism</topic><topic>Computation and Language</topic><topic>Computer Science</topic><topic>Corpus linguistics</topic><topic>Dictionaries</topic><topic>English language</topic><topic>Experiments</topic><topic>French language</topic><topic>Information sources</topic><topic>Multilingualism</topic><topic>Natural language processing</topic><topic>Obama, Barack</topic><topic>Recall</topic><topic>Semantics</topic><topic>Similarity measures</topic><topic>Translation</topic><topic>Translations</topic><topic>Websites</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>LANGLOIS, D.</creatorcontrib><creatorcontrib>SAAD, M.</creatorcontrib><creatorcontrib>SMAILI, K.</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Psychology Database (Alumni)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>Social Science Premium Collection</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>Linguistics Collection</collection><collection>Linguistics Database</collection><collection>ProQuest Central Korea</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Engineering Collection</collection><collection>Computing Database</collection><collection>Psychology Database</collection><collection>Engineering Database</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest One Psychology</collection><collection>Engineering Collection</collection><collection>ProQuest Central Basic</collection><collection>Hyper Article en Ligne (HAL)</collection><collection>Hyper Article en Ligne (HAL) (Open Access)</collection><jtitle>Natural language engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>LANGLOIS, D.</au><au>SAAD, M.</au><au>SMAILI, K.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Alignment of comparable documents: Comparison of similarity measures on French–English–Arabic data</atitle><jtitle>Natural language engineering</jtitle><addtitle>Nat. Lang. Eng</addtitle><date>2018-09-01</date><risdate>2018</risdate><volume>24</volume><issue>5</issue><spage>677</spage><epage>694</epage><pages>677-694</pages><issn>1351-3249</issn><eissn>1469-8110</eissn><abstract>The objective, in this article, is to address the issue of the comparability of documents, which are extracted from different sources and written in different languages. These documents are not necessarily translations of each other. This material is referred as multilingual comparable corpora. These language resources are useful for multilingual natural language processing applications, especially for low-resourced language pairs. In this paper, we collect different data in Arabic, English, and French. Two corpora are built by using available hyperlinks for Wikipedia and Euronews. Euronews is an aligned multilingual (Arabic, English, and French) corpus of 34k documents collected from Euronews website. A more challenging issue is to build comparable corpus from two different and independent media having two distinct editorial lines, such as British Broadcasting Corporation (BBC) and Al Jazeera (JSC). To build such corpus, we propose to use the Cross-Lingual Latent Semantic approach. For this purpose, documents have been harvested from BBC and JSC websites for each month of the years 2012 and 2013. The comparability is calculated for each Arabic–English couple of documents of each month. This automatic task is then validated by hand. This led to a multilingual (Arabic–English) aligned corpus of 305 pairs of documents (233k English words and 137k Arabic words). In addition, a study is presented in this paper to analyze the performance of three methods of the literature allowing to measure the comparability of documents on the multilingual reference corpora. A recall at rank 1 of 50.16 per cent is achieved with the Cross-lingual LSI approach for BBC–JSC test corpus, while the dictionary-based method reaches a recall of only 35.41 per cent.</abstract><cop>Cambridge, UK</cop><pub>Cambridge University Press</pub><doi>10.1017/S1351324918000232</doi><tpages>18</tpages><orcidid>https://orcid.org/0000-0002-1080-7276</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1351-3249 |
ispartof | Natural language engineering, 2018-09, Vol.24 (5), p.677-694 |
issn | 1351-3249 1469-8110 |
language | eng |
recordid | cdi_hal_primary_oai_HAL_hal_01819710v1 |
source | Cambridge Journals Online; Social Science Premium Collection; Linguistics Collection; Linguistics and Language Behavior Abstracts (LLBA) |
subjects | Alignment Arabic language Bilingualism Computation and Language Computer Science Corpus linguistics Dictionaries English language Experiments French language Information sources Multilingualism Natural language processing Obama, Barack Recall Semantics Similarity measures Translation Translations Websites |
title | Alignment of comparable documents: Comparison of similarity measures on French–English–Arabic data |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T16%3A32%3A50IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_hal_p&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Alignment%20of%20comparable%20documents:%20Comparison%20of%20similarity%20measures%20on%20French%E2%80%93English%E2%80%93Arabic%20data&rft.jtitle=Natural%20language%20engineering&rft.au=LANGLOIS,%20D.&rft.date=2018-09-01&rft.volume=24&rft.issue=5&rft.spage=677&rft.epage=694&rft.pages=677-694&rft.issn=1351-3249&rft.eissn=1469-8110&rft_id=info:doi/10.1017/S1351324918000232&rft_dat=%3Cproquest_hal_p%3E2080655392%3C/proquest_hal_p%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c394t-9bf10fd4608f7af9902f6d5206246f3d5d3578e280cbdd30376bbb3cfdf1665f3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2080655392&rft_id=info:pmid/&rft_cupid=10_1017_S1351324918000232&rfr_iscdi=true |