SCIENTIA SINICA Informationis, Volume 48, Issue 5: 564-573(2018) https://doi.org/10.1360/N112017-00256

## Bilingual lexicon induction from non-parallel corpora

Meng ZHANG1,2,3, Yang LIU1,2,3,*, Maosong SUN1,2,3
• AcceptedJan 26, 2018
• PublishedMay 11, 2018
Share
Rating

### Abstract

In cross-lingual natural language processing, the lack of parallel data is a serious problem. However, this is common in scenarios with scarce language resources. In this case, better utilizing translational equivalence encoded in non-parallel corpora becomes more important. Owing to the non-parallelism of the corpora, acquiring translational equivalence faces the challenging problem of small data or unsupervised learning, and the result usually takes the form of a bilingual lexicon. Not only is this an important research problem in the field of artificial intelligence, but it also has significant application value in scenarios with scarce language resources. This paper introduces a series of studies that address problems in prior research, exploring how to obtain better bilingual lexica with non-parallel corpora from various perspectives.

### References

[1] Levow G A, Oard D W, Resnik P. Dictionary-based techniques for cross-language information retrieval. Inf Processing Manage, 2005, 41: 523-547 CrossRef Google Scholar

[2] Och F J, Ney H. A Systematic Comparison of Various Statistical Alignment Models. Comput Linguistics, 2003, 29: 19-51 CrossRef Google Scholar

[3] T$\ddot{\rm~a}$ckstr$\ddot{\rm~o}$m O, Das D, Petrov S, et al. Token and type constraints for cross-lingual part-of-speech tagging. Trans Assoc Comput Linguist, 2013, 1: 1--12. Google Scholar

[4] Rapp R. Automatic identification of word translations from unrelated English and German corpora. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, College Park, 1999. 519--526. Google Scholar

[5] Gaussier E, Renders J M, Matveeva I, et al. A geometric view on bilingual lexicon extraction from comparable corpora. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Barcelona, 2004. Google Scholar

[6] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, 2013. 3111--3119. Google Scholar

[7] Mikolov T, Le Q V, Sutskever I. Exploiting similarities among languages for machine translation,. arXiv Google Scholar

[8] Faruqui M, Dyer C. Improving vector space word representations using multilingual correlation. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, 2014. 462--471. Google Scholar

[9] Lu A, Wang W, Bansal M, et al. Deep multilingual correlation for improved word embeddings. In: Proceedings of the 2015 Annual Conference of the North American Chapter of the ACL, Denver, 2015. 250--256. Google Scholar

[10] Dinu G, Lazaridou A, Baroni M. Improving zero-shot learning by mitigating the hubness problem,. arXiv Google Scholar

[11] Lazaridou A, Dinu G, Baroni M. Hubness and pollution: delving into cross-space mapping for zero-shot learning. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, 2015. 270--280. Google Scholar

[12] Ammar W, Mulcaire G, Tsvetkov Y, et al. Massively multilingual word embeddings,. arXiv Google Scholar

[13] Smith S, Turban D, Hamblin S, et al. Offline bilingual word vectors, orthogonal transformations and the inverted softmax,. arXiv Google Scholar

[14] Vulic I, Moens M F. Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, 2015. 719--725. Google Scholar

[15] Zou W Y, Socher R, Cer D, et al. Bilingual word embeddings for phrase-based machine translation. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, 2013. 1393--1398. Google Scholar

[16] Chandar A P S, Lauly S, Larochelle H, et al. An autoencoder approach to learning bilingual word representations. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, 2014. 1853--1861. Google Scholar

[17] Hermann K M, Blunsom P. Multilingual distributed representations without word alignment,. arXiv Google Scholar

[18] Ko$\breve{\rm~c}$isk$\acute{\rm~y}$ T, Hermann K M, Blunsom P. Learning bilingual word representations by marginalizing alignments. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, 2014. 224--229. Google Scholar

[19] Gouws S, Bengio Y, Corrado G. Bilbowa: fast bilingual distributed representations without word alignments. In: Proceedings of the 32nd International Conference on Machine Learning, Lille, 2015. 748--756. Google Scholar

[20] Luong T, Pham H, Manning C D. Bilingual word representations with monolingual quality in mind. In: Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Denver, 2015. 151--159. Google Scholar

[21] Coulmance J, Marty J M, Wenzek G, et al. Trans-gram, fast cross-lingual word-embeddings. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, 2015. 1109--1113. Google Scholar

[22] Oshikiri T, Fukui K, Shimodaira H. Cross-lingual word representations via spectral graph embeddings. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, 2016. 493--498. Google Scholar

[23] Gouws S, Sogaard A. Simple task-specific bilingual word embeddings. In: Proceedings of the 2015 Annual Conference of the North American Chapter of the ACL, Denver, 2015. 1386--1390. Google Scholar

[24] Wick M, Kanani P, Pocock A. Minimally-constrained multilingual embeddings via artificial code-switching. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix, 2016. 2849--2855. Google Scholar

[25] Duong L, Kanayama H, Ma T, et al. Learning crosslingual word embeddings without bilingual corpora. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, 2016. 1285--1295. Google Scholar

[26] Shi T, Liu Z, Liu Y, et al. Learning cross-lingual word embeddings via matrix co-factorization. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, 2015. 567--572. Google Scholar

[27] Zhang M, Liu Y, Luan H, et al. Building earth mover's distance on bilingual word embeddings for machine translation. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix, 2016. 2870--2876. Google Scholar

[28] Resnik P, Yarowsky D. Distinguishing systems and distinguishing senses: new evaluation methods for Word Sense Disambiguation. Nat Lang Eng, 1999, 5: 113-133 CrossRef Google Scholar

[29] Zhang M, Liu Y, Luan H, et al. Inducing bilingual lexica from non-parallel data with earth mover's distance regularization. In: Proceedings of the 26th International Conference on Computational Linguistics, Osaka, 2016. 3188--3198. Google Scholar

[30] Zhang M, Peng H, Liu Y, et al. Bilingual lexicon induction from non-parallel data with minimal supervision. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, 2017. 3379--3385. Google Scholar

[31] Goodfellow I J, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, 2014. 2672--2680. Google Scholar

[32] Zhang M, Liu Y, Luan H, et al. Adversarial training for unsupervised bilingual lexicon induction. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, 2017. 1959--1970. Google Scholar

[33] Arjovsky M, Chintala S, Bottou L. Wasserstein GAN,. arXiv Google Scholar

[34] Zhang M, Liu Y, Luan H, et al. Earth mover's distance minimization for unsupervised bilingual lexicon induction. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, 2017. 1934--1945. Google Scholar

• Figure 1

(Color online) Bilingual lexicon induction from non-parallel corpora

• Figure 2

(Color online) The hubness problem. (a) The nearest neighbor suffers from the hubness problem and produces wrong translation; (b) the earth mover's distance is able to produce correct translation for this example

• Figure 3

(Color online) Multiple alternative translation based on earth mover's distance

• Figure 4

Bilingual matching model based on latent variables

• Figure 5

Monolingual word embedding spaces of Spanish and English

• Figure 6

(Color online) Adversarial training. (a) Unidirectional transformation model; (b) bidirectional transformation model; (c) adversarial autoencoder model

• Figure 7

(Color online) Learning by earth mover's distance minimization

Citations

• #### 0

Altmetric

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有