Loading…
Cross-Lingual Word Embedding Generation Based on Procrustes-Hungarian Linear Projection
The scale and quality of data in low-resource language machine translation limit the development of word embedding techniques. Supervised cross-linguistic word embedding requires large-scale, high-quality seed dictionaries, making construction difficult. Although adversarial learning methods provide...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The scale and quality of data in low-resource language machine translation limit the development of word embedding techniques. Supervised cross-linguistic word embedding requires large-scale, high-quality seed dictionaries, making construction difficult. Although adversarial learning methods provide an alternative, the alignment accuracy is relatively low due to factors such as generator and discriminator quality and noise. Moreover, multiple alignments of words across languages lead to error propagation issues. This paper adopts a multilingual alignment word embedding method, aiming to achieve cross-linguistic word embedding through supervised or unsupervised learning by mapping word embeddings from different languages into a shared semantic space. For low-resource tasks lacking supervision signals, a rough mapping matrix is pre-learned using a manually constructed bilingual seed dictionary, then adjusted through optimization to solve the alignment issue of low-frequency words. We combine the Procrustes method with the Hungarian algorithm and incorporate Sinkhorn entropy regularization to reduce computational complexity. Adversarial learning methods are also employed, optimizing the mapping matrix by training a discriminator and a generator to improve cross-linguistic word embedding quality. Experiments on the CCMT2019 corpus for Mongolian-Chinese (Mo-Zh), Uyghur-Chinese (Ug-Zh), and Tibetan-Chinese (Ti-Zh). The results show that this cross-linguistic word embedding method improves BLEU scores by 1.79, 2.17, and 1.45 for the three language pairs respectively, demonstrating its practicality and effectiveness. |
---|---|
ISSN: | 2159-1970 |
DOI: | 10.1109/IALP63756.2024.10661153 |