logo

SCIENTIA SINICA Informationis, Volume 50 , Issue 6 : 813-823(2020) https://doi.org/10.1360/SSI-2019-0284

A semantic relation preserved word embedding reuse method

More info
  • ReceivedDec 24, 2019
  • AcceptedApr 14, 2020
  • PublishedJun 1, 2020

Abstract

When deep learning is applied to natural language processing, a word embedding layer can improve task performance significantly due to the semantic information expressed in word vectors. Word embeddings can be optimized end-to-end with the whole framework. However, considering the number of parameters in a word embedding layer, in tasks with a small corpus, the training set can easily be overfitted. To solve this problem, pretrained embeddings obtained from a much larger corpus will be utilized to boost the current model performance. This paper summarizes several methods to reuse pretrained word embeddings. In addition, as corpus topics change, new words will appear for a given task, and their corresponding embeddings cannot be obtained from pretrained vectors. Therefore, to reuse word embeddings, we propose a semantic relation preserved word embedding reuse method. The proposed method first learns word relations from the current corpus. Then, pretrained word embeddings are utilized to help generate embeddings for new observed words. Experimental results verify the effectiveness of the proposed method.


Funded by

国家重点研发计划(2018YFB1004300)

国家自然科学基金(61773198,61632004)


References

[1] Otter D W, Medina J R, Kalita J K. A Survey of the Usages of Deep Learning in Natural Language Processing. 2018,. arXiv Google Scholar

[2] Young T, Hazarika D, Poria S, et al. Recent Trends in Deep Learning Based Natural Language Processing. 2017,. arXiv Google Scholar

[3] Zhang L, Wang S, Liu B. Deep learning for sentiment analysis: A survey. Wiley Interdiscip Rev Data Min Knowl Discov, 2018, 8: e1253. Google Scholar

[4] Zhu M, Pan P, Chen W, et al. DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 2019. 5802--5810. Google Scholar

[5] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space. In: Proceedings of the 1st International Conference on Learning Representations, Scottsdale, 2013. Google Scholar

[6] Pennington J, Socher R, Manning C D. Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, 2014. 1532--1543. Google Scholar

[7] Kim Y. Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, 2014. 1746--1751. Google Scholar

[8] Gu X, Gu Y, Wu H. Cascaded Convolutional Neural Networks for Aspect-Based Opinion Summary. Neural Process Lett, 2017, 46: 581-594 CrossRef Google Scholar

[9] Irsoy O, Cardie C. Opinion mining with deep recurrent neural networks. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, 2014. 720--728. Google Scholar

[10] Zhang X, Lapata M. Chinese poetry generation with recurrent neural networks. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, 2014. 670--680. Google Scholar

[11] Zhou Z H. Learnware: on the future of machine learning. Front Comput Sci, 2016, 10: 589-590 CrossRef Google Scholar

[12] Sylvestre-Alvise R, Alexander K, Georg S, et al. iCaRL: incremental classifier and representation learning. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017. 5533--5542. Google Scholar

[13] Kibok L, Kimin L, Jinwoo S, et al. Overcoming catastrophic forgetting with unlabeled data in the wild. In: Proceedings of 2019 International Conference on Computer Vision, Seoul, 2019. 312--321. Google Scholar

[14] Matthias D L, Rahaf A, Marc M, et al. Continual learning: a comparative study on how to defy forgetting in classification tasks. 2019,. arXiv Google Scholar

[15] Eden B, Adrian P. IL2M: class incremental learning with dual memory. In: Proceedings of the 2019 International Conference on Computer Vision, Seoul, 2019. 583--592. Google Scholar

[16] Ethayarajh K, Duvenaud D, Hirst G. Towards understanding linear word analogies. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, 2019. 3253--3262. Google Scholar

[17] Papadimitriou C H, Raghavan P, Tamaki H. Latent Semantic Indexing: A Probabilistic Analysis. J Comput Syst Sci, 2000, 61: 217-235 CrossRef Google Scholar

[18] Bengio Y, Ducharme R, Vincent P, et al. A Neural Probabilistic Language Model. Mach Learn, 2003, 3: 1137-1155. Google Scholar

[19] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems, Lake Tahoe, 2013. 3111--3119. Google Scholar

[20] Peters M E, Neumann M, Iyyer M, et al. Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Louisiana, 2018. 2227--2237. Google Scholar

[21] Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019. 4171--4186. Google Scholar

[22] Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training. 2018. Google Scholar

[23] Chen X, Cardie C. Unsupervised multilingual word embeddings. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, 2018. 261--270. Google Scholar

[24] Alaux J, Grave E, Cuturi M, et al. Unsupervised hyper-alignment for multilingual word embeddings. In: Proceedings of the 7th International Conference on Learning Representations, 2019. Google Scholar

[25] Yang W, Lu W, Zheng V W. A simple regularization-based algorithm for learning cross-domain word embeddings. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, 2017. 2898--2904. Google Scholar

[26] Neill J O, Bollegala D. Semi-supervised multi-task word embeddings. 2018,. arXiv Google Scholar

[27] Peng H, Li J, Song Y, et al. Incrementally learning the hierarchical softmax function for neural language models. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence, 2017. 3267--3273. Google Scholar

[28] Kaji N, Kobayashi H. Incremental skip-gram model with negative sampling. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, 2017. 363--371. Google Scholar

[29] Kabbach A, Gulordava K, Herbelot A. Towards incremental learning of word embeddings using context informativeness. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, 2019. 162--168. Google Scholar

[30] Hu Z, Chen T, Chang K W, et al. Few-shot representation learning for out-of-vocabulary words. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, 2019. 4102--4112. Google Scholar

[31] Tang D, Qin B, Liu T. Document modeling with gated recurrent neural network for sentiment classification. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, 2015. 1422--1432. Google Scholar

  • Figure 1

    (Color online) The illustration of the deep NLP framework

  • Figure 2

    (Color online) The illustration of the comparison between SrpWer and other conventional methods

  • Figure 3

    (Color online) Learning curves for different usages based on Yelp13 task and WIKI embeddings. (a) Training loss; (b) test loss

  • Table 1   The statistics among vocabularies of pretrained corpus and current tasks
    IMDBNews20Yelp13
    $n_{\rm~I}$ $n_{\rm~O}$ $n_{\rm~I}$ $n_{\rm~O}$ $n_{\rm~I}$ $n_{\rm~O}$
    WIKI-Glove 19911 89 17902 2038 18965 1035
    IMDB-SG 19976 24 13616 6384 18856 1144
  •   

    Algorithm 1 保持语义关系的词向量复用算法(SrpWer)

    Require:

    $~\text{预训练词向量}~\boldsymbol{W}_{\rm~P}~\in~\mathbb{R}^{n_{\rm~p}~\times~d};$

    $~\text{预训练语料的词库}~V_{\rm~P}~=~\{v_i\}_{i=1}^{n_p};$

    $~\text{当前语料库}~\mathcal{D};$

    $~\text{当前词库}~V~=~\{v_j\}_{j=1}^n;$

    $~\text{正则化项系数}~\lambda.$

    Output:

    $~\text{\rm~Skipgram}(\mathcal{D})~\rightarrow~\boldsymbol{W}~\in~\mathbb{R}^{n~\times~d};$

    $~V_{\rm~I}~=~V_{\rm~P}~\cap~V,$ $V_{\rm~O}~=~V~\setminus~V_{\rm~I};$

    $~\boldsymbol{W}_{\rm~I}~=~\text{\rm~lookup}(\boldsymbol{W},~V_{\rm~I}),$ $\boldsymbol{W}_{\rm~O}~=~\text{\rm~lookup}(\boldsymbol{W},~V_{\rm~O});$

    $~\boldsymbol{W}_{\rm~PI}~=~\text{\rm~lookup}(\boldsymbol{W}_{\rm~P},~V_{\rm~I});$

    $~\boldsymbol{Z}^{\star}~=~\arg\min_{\boldsymbol{Z}}~\left~\Vert~\boldsymbol{Z}^{\rm~T}~\boldsymbol{W}_{\rm~I}~-~\boldsymbol{W}_{\rm~O}~\right~\Vert_{\rm~F}^2~+~\lambda~\Vert~\boldsymbol{Z}~\Vert_{\rm~F}^2;$

    $~\boldsymbol{W}_{\rm~PO}~=~{\boldsymbol{Z}^{\star}}^{\rm~T}\boldsymbol{W}_{\rm~PI};$

    $~\hat{\boldsymbol{W}}~=~\left[~\boldsymbol{W}_{\rm~PI};~\boldsymbol{W}_{\rm~PO}~\right].$输出:$~\hat{\boldsymbol{W}}~\in~\mathbb{R}^{n~\times~d}.$

  • Table 2   Performance comparisons of different usages on three NLP tasks based on WIKI-Glove embeddings
    CNNGRU
    IMDB News20 Yelp13 IMDB News20 Yelp13
    NoPT 0.318 0.416 0.508 0.429 0.684 0.596
    PT-NoFT 0.309 0.444 0.553 0.476 0.777 0.609
    PT-FT 0.336 0.741 0.587 0.473 0.816 0.603
    PT-FT-Mu 0.321 0.665 0.565 0.471 0.809 0.615
    SrpWer-NoFT 0.322 0.648 0.598 0.472 0.806 0.623
    SrpWer-FT 0.369 0.719 0.626 0.469 0.805 0.612
    SrpWer-FT-Mu 0.350 0.677 0.624 0.480 0.809 0.631
    Improve +0.033 $-$0.022 +0.039 +0.004 $-$0.007 +0.016
  • Table 3   Performance comparisons of different usages on three NLP tasks based on IMDB-SG embeddings
    CNNGRU
    IMDB News20 Yelp13 IMDB News20 Yelp13
    NoPT 0.293 0.553 0.541 0.450 0.703 0.614
    PT-NoFT 0.330 0.578 0.551 0.499 0.734 0.628
    PT-FT 0.338 0.745 0.578 0.466 0.819 0.605
    PT-FT-Mu 0.340 0.652 0.574 0.485 0.812 0.618
    SrpWer-NoFT 0.353 0.566 0.598 0.481 0.786 0.613
    SrpWer-FT 0.350 0.686 0.595 0.469 0.819 0.642
    SrpWer-FT-Mu 0.373 0.652 0.634 0.503 0.802 0.641
    Improve +0.033 $-$0.059 +0.056 +0.004 +0.000 +0.014
  • Table 4   Performance comparisons of varying new words' ratios on News20 task based on IMDB-SG embeddings and GRU network
    0.01 0.05 0.10 0.15 0.20 0.30 0.40
    SrpWer-NoFT 0.811 0.807 0.816 0.812 0.803 0.798 0.794
    SrpWer-FT 0.822 0.832 0.828 0.817 0.819 0.812 0.817
    SrpWer-FT-Mu 0.816 0.821 0.826 0.815 0.812 0.802 0.805

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备17057255号       京公网安备11010102003388号