SCIENTIA SINICA Informationis, Volume 48, Issue 5: 564-573(2018)

## Bilingual lexicon induction from non-parallel corpora

Meng ZHANG1,2,3, Yang LIU1,2,3,*, Maosong SUN1,2,3
• AcceptedJan 26, 2018
• PublishedMay 11, 2018
### Abstract

In cross-lingual natural language processing, the lack of parallel data is a serious problem. However, this is common in scenarios with scarce language resources. In this case, better utilizing translational equivalence encoded in non-parallel corpora becomes more important. Owing to the non-parallelism of the corpora, acquiring translational equivalence faces the challenging problem of small data or unsupervised learning, and the result usually takes the form of a bilingual lexicon. Not only is this an important research problem in the field of artificial intelligence, but it also has significant application value in scenarios with scarce language resources. This paper introduces a series of studies that address problems in prior research, exploring how to obtain better bilingual lexica with non-parallel corpora from various perspectives.

• Figure 1

(Color online) Bilingual lexicon induction from non-parallel corpora

• Figure 2

(Color online) The hubness problem. (a) The nearest neighbor suffers from the hubness problem and produces wrong translation; (b) the earth mover's distance is able to produce correct translation for this example

• Figure 3

(Color online) Multiple alternative translation based on earth mover's distance

• Figure 4

Bilingual matching model based on latent variables

• Figure 5

Monolingual word embedding spaces of Spanish and English

• Figure 6

(Color online) Adversarial training. (a) Unidirectional transformation model; (b) bidirectional transformation model; (c) adversarial autoencoder model

• Figure 7

(Color online) Learning by earth mover's distance minimization

