SCIENTIA SINICA Informationis, Volume 50 , Issue 6 : 813-823(2020) https://doi.org/10.1360/SSI-2019-0284

## A semantic relation preserved word embedding reuse method

• AcceptedApr 14, 2020
• PublishedJun 1, 2020
Share
Rating

### Abstract

When deep learning is applied to natural language processing, a word embedding layer can improve task performance significantly due to the semantic information expressed in word vectors. Word embeddings can be optimized end-to-end with the whole framework. However, considering the number of parameters in a word embedding layer, in tasks with a small corpus, the training set can easily be overfitted. To solve this problem, pretrained embeddings obtained from a much larger corpus will be utilized to boost the current model performance. This paper summarizes several methods to reuse pretrained word embeddings. In addition, as corpus topics change, new words will appear for a given task, and their corresponding embeddings cannot be obtained from pretrained vectors. Therefore, to reuse word embeddings, we propose a semantic relation preserved word embedding reuse method. The proposed method first learns word relations from the current corpus. Then, pretrained word embeddings are utilized to help generate embeddings for new observed words. Experimental results verify the effectiveness of the proposed method.

### References

[1] Otter D W, Medina J R, Kalita J K. A Survey of the Usages of Deep Learning in Natural Language Processing. 2018,. arXiv Google Scholar

[2] Young T, Hazarika D, Poria S, et al. Recent Trends in Deep Learning Based Natural Language Processing. 2017,. arXiv Google Scholar

[3] Zhang L, Wang S, Liu B. Deep learning for sentiment analysis: A survey. Wiley Interdiscip Rev Data Min Knowl Discov, 2018, 8: e1253. Google Scholar

[4] Zhu M, Pan P, Chen W, et al. DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 2019. 5802--5810. Google Scholar

[5] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space. In: Proceedings of the 1st International Conference on Learning Representations, Scottsdale, 2013. Google Scholar

[6] Pennington J, Socher R, Manning C D. Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, 2014. 1532--1543. Google Scholar

[7] Kim Y. Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, 2014. 1746--1751. Google Scholar

[8] Gu X, Gu Y, Wu H. Cascaded Convolutional Neural Networks for Aspect-Based Opinion Summary. Neural Process Lett, 2017, 46: 581-594 CrossRef Google Scholar

[9] Irsoy O, Cardie C. Opinion mining with deep recurrent neural networks. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, 2014. 720--728. Google Scholar

[10] Zhang X, Lapata M. Chinese poetry generation with recurrent neural networks. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, 2014. 670--680. Google Scholar

[11] Zhou Z H. Learnware: on the future of machine learning. Front Comput Sci, 2016, 10: 589-590 CrossRef Google Scholar

[12] Sylvestre-Alvise R, Alexander K, Georg S, et al. iCaRL: incremental classifier and representation learning. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017. 5533--5542. Google Scholar

[13] Kibok L, Kimin L, Jinwoo S, et al. Overcoming catastrophic forgetting with unlabeled data in the wild. In: Proceedings of 2019 International Conference on Computer Vision, Seoul, 2019. 312--321. Google Scholar

[14] Matthias D L, Rahaf A, Marc M, et al. Continual learning: a comparative study on how to defy forgetting in classification tasks. 2019,. arXiv Google Scholar

[15] Eden B, Adrian P. IL2M: class incremental learning with dual memory. In: Proceedings of the 2019 International Conference on Computer Vision, Seoul, 2019. 583--592. Google Scholar

[16] Ethayarajh K, Duvenaud D, Hirst G. Towards understanding linear word analogies. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, 2019. 3253--3262. Google Scholar

[17] Papadimitriou C H, Raghavan P, Tamaki H. Latent Semantic Indexing: A Probabilistic Analysis. J Comput Syst Sci, 2000, 61: 217-235 CrossRef Google Scholar

[18] Bengio Y, Ducharme R, Vincent P, et al. A Neural Probabilistic Language Model. Mach Learn, 2003, 3: 1137-1155. Google Scholar

[19] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems, Lake Tahoe, 2013. 3111--3119. Google Scholar

[20] Peters M E, Neumann M, Iyyer M, et al. Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Louisiana, 2018. 2227--2237. Google Scholar

[21] Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019. 4171--4186. Google Scholar

[22] Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training. 2018. Google Scholar

[23] Chen X, Cardie C. Unsupervised multilingual word embeddings. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, 2018. 261--270. Google Scholar

[24] Alaux J, Grave E, Cuturi M, et al. Unsupervised hyper-alignment for multilingual word embeddings. In: Proceedings of the 7th International Conference on Learning Representations, 2019. Google Scholar

[25] Yang W, Lu W, Zheng V W. A simple regularization-based algorithm for learning cross-domain word embeddings. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, 2017. 2898--2904. Google Scholar

[26] Neill J O, Bollegala D. Semi-supervised multi-task word embeddings. 2018,. arXiv Google Scholar

[27] Peng H, Li J, Song Y, et al. Incrementally learning the hierarchical softmax function for neural language models. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence, 2017. 3267--3273. Google Scholar

[28] Kaji N, Kobayashi H. Incremental skip-gram model with negative sampling. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, 2017. 363--371. Google Scholar

[29] Kabbach A, Gulordava K, Herbelot A. Towards incremental learning of word embeddings using context informativeness. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, 2019. 162--168. Google Scholar

[30] Hu Z, Chen T, Chang K W, et al. Few-shot representation learning for out-of-vocabulary words. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, 2019. 4102--4112. Google Scholar

[31] Tang D, Qin B, Liu T. Document modeling with gated recurrent neural network for sentiment classification. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, 2015. 1422--1432. Google Scholar

• Figure 1

(Color online) The illustration of the deep NLP framework

• Figure 2

(Color online) The illustration of the comparison between SrpWer and other conventional methods

• Figure 3

(Color online) Learning curves for different usages based on Yelp13 task and WIKI embeddings. (a) Training loss; (b) test loss

•

Algorithm 1 保持语义关系的词向量复用算法(SrpWer)

输入:

$~\text{预训练词向量}~\boldsymbol{W}_{\rm~P}~\in~\mathbb{R}^{n_{\rm~p}~\times~d};$

$~\text{预训练语料的词库}~V_{\rm~P}~=~\{v_i\}_{i=1}^{n_p};$

$~\text{当前语料库}~\mathcal{D};$

$~\text{当前词库}~V~=~\{v_j\}_{j=1}^n;$

$~\text{正则化项系数}~\lambda.$

主迭代:

$~\text{\rm~Skipgram}(\mathcal{D})~\rightarrow~\boldsymbol{W}~\in~\mathbb{R}^{n~\times~d};$

$~V_{\rm~I}~=~V_{\rm~P}~\cap~V,$ $V_{\rm~O}~=~V~\setminus~V_{\rm~I};$

$~\boldsymbol{W}_{\rm~I}~=~\text{\rm~lookup}(\boldsymbol{W},~V_{\rm~I}),$ $\boldsymbol{W}_{\rm~O}~=~\text{\rm~lookup}(\boldsymbol{W},~V_{\rm~O});$

$~\boldsymbol{W}_{\rm~PI}~=~\text{\rm~lookup}(\boldsymbol{W}_{\rm~P},~V_{\rm~I});$

${Z}^{\star} = \arg\min_{{Z}} \left \Vert{Z}^{\rm T}{W}_{\rm I} -{W}_{\rm O} \right \Vert_{\rm F}^2 + \lambda \Vert {Z} \Vert_{\rm F}^2;$

$~\boldsymbol{W}_{\rm~PO}~=~{\boldsymbol{Z}^{\star}}^{\rm~T}\boldsymbol{W}_{\rm~PI};$

$~\hat{\boldsymbol{W}}~=~\left[~\boldsymbol{W}_{\rm~PI};~\boldsymbol{W}_{\rm~PO}~\right].$

$输出:$~\hat{\boldsymbol{W}}~\in~\mathbb{R}^{n~\times~d}.$• Table 1 The statistics among vocabularies of pretrained corpus and current tasks  IMDB News20 Yelp13$n_{\rm~I}n_{\rm~O}n_{\rm~I}n_{\rm~O}n_{\rm~I}n_{\rm~O}$WIKI-Glove 19911 89 17902 2038 18965 1035 IMDB-SG 19976 24 13616 6384 18856 1144 • Table 2 Performance comparisons of different usages on three NLP tasks based on WIKI-Glove embeddings  CNN GRU IMDB News20 Yelp13 IMDB News20 Yelp13 NoPT 0.318 0.416 0.508 0.429 0.684 0.596 PT-NoFT 0.309 0.444 0.553 0.476 0.777 0.609 PT-FT 0.336 0.741 0.587 0.473 0.816 0.603 PT-FT-Mu 0.321 0.665 0.565 0.471 0.809 0.615 SrpWer-NoFT 0.322 0.648 0.598 0.472 0.806 0.623 SrpWer-FT 0.369 0.719 0.626 0.469 0.805 0.612 SrpWer-FT-Mu 0.350 0.677 0.624 0.480 0.809 0.631 Improve +0.033$-$0.022 +0.039 +0.004$-$0.007 +0.016 • Table 3 Performance comparisons of different usages on three NLP tasks based on IMDB-SG embeddings  CNN GRU IMDB News20 Yelp13 IMDB News20 Yelp13 NoPT 0.293 0.553 0.541 0.450 0.703 0.614 PT-NoFT 0.330 0.578 0.551 0.499 0.734 0.628 PT-FT 0.338 0.745 0.578 0.466 0.819 0.605 PT-FT-Mu 0.340 0.652 0.574 0.485 0.812 0.618 SrpWer-NoFT 0.353 0.566 0.598 0.481 0.786 0.613 SrpWer-FT 0.350 0.686 0.595 0.469 0.819 0.642 SrpWer-FT-Mu 0.373 0.652 0.634 0.503 0.802 0.641 Improve +0.033$-\$0.059 +0.056 +0.004 +0.000 +0.014
• Table 4   Performance comparisons of varying new words' ratios on News20 task based on IMDB-SG embeddings and GRU network
 0.01 0.05 0.1 0.15 0.2 0.3 0.4 SrpWer-NoFT 0.811 0.807 0.816 0.812 0.803 0.798 0.794 SrpWer-FT 0.822 0.832 0.828 0.817 0.819 0.812 0.817 SrpWer-FT-Mu 0.816 0.821 0.826 0.815 0.812 0.802 0.805

Citations

• #### 0

Altmetric

Copyright 2020  CHINA SCIENCE PUBLISHING & MEDIA LTD.  中国科技出版传媒股份有限公司  版权所有