Loading…

Cross-Lingual Word Embedding Generation Based on Procrustes-Hungarian Linear Projection

The scale and quality of data in low-resource language machine translation limit the development of word embedding techniques. Supervised cross-linguistic word embedding requires large-scale, high-quality seed dictionaries, making construction difficult. Although adversarial learning methods provide...

Full description

Saved in:

Bibliographic Details
Main Authors:	Xing, Hao, Wu, Nier, Ji, Yatu, Liu, Yang, Liu, Na, Lu, Min
Format:	Conference Proceeding
Language:	English
Subjects:	Accuracy Adversarial machine learning Cross-lingual word embedding Dictionaries Generators Low-resource language translation Neural machine translation Noise Semantics Training
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The scale and quality of data in low-resource language machine translation limit the development of word embedding techniques. Supervised cross-linguistic word embedding requires large-scale, high-quality seed dictionaries, making construction difficult. Although adversarial learning methods provide an alternative, the alignment accuracy is relatively low due to factors such as generator and discriminator quality and noise. Moreover, multiple alignments of words across languages lead to error propagation issues. This paper adopts a multilingual alignment word embedding method, aiming to achieve cross-linguistic word embedding through supervised or unsupervised learning by mapping word embeddings from different languages into a shared semantic space. For low-resource tasks lacking supervision signals, a rough mapping matrix is pre-learned using a manually constructed bilingual seed dictionary, then adjusted through optimization to solve the alignment issue of low-frequency words. We combine the Procrustes method with the Hungarian algorithm and incorporate Sinkhorn entropy regularization to reduce computational complexity. Adversarial learning methods are also employed, optimizing the mapping matrix by training a discriminator and a generator to improve cross-linguistic word embedding quality. Experiments on the CCMT2019 corpus for Mongolian-Chinese (Mo-Zh), Uyghur-Chinese (Ug-Zh), and Tibetan-Chinese (Ti-Zh). The results show that this cross-linguistic word embedding method improves BLEU scores by 1.79, 2.17, and 1.45 for the three language pairs respectively, demonstrating its practicality and effectiveness.
ISSN:	2159-1970
DOI:	10.1109/IALP63756.2024.10661153