logo

SCIENCE CHINA Information Sciences, Volume 64 , Issue 3 : 130103(2021) https://doi.org/10.1007/s11432-020-3055-1

Few-shot text classification by leveraging bi-directional attention and cross-class knowledge

More info
  • ReceivedJun 29, 2020
  • AcceptedJul 30, 2020
  • PublishedFeb 7, 2021

Abstract


Acknowledgment

This work was partially supported by National Natural Science Foundation of China (Grant Nos. 61872446, U19B2024), Natural Science Foundation of Hunan Province (Grant No. 2019JJ20024), and the Science and Technology Innovation Program of Hunan Province (Grant No. 2020RC4046).


References

[1] Pang B, Lee L. Opinion Mining and Sentiment Analysis. FNT Inf Retrieval, 2008, 2: 1-135 CrossRef Google Scholar

[2] Aggarwal C C, Zhai C. A survey of text classification algorithms. In: Proceedings of Mining Text Data, 2012. 163--222. Google Scholar

[3] Zhang X, Zhao J, LeCun Y. Character-level convolutional networks for text classification. In: Proceedings of Advances in Neural Information Processing Systems, 2015. 649--657. Google Scholar

[4] Kim Y. Convolutional neural networks for sentence classification. 2014,. arXiv Google Scholar

[5] Yao L, Mao C, and Luo Y. Graph convolutional networks for text classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2019. 7370--7377. Google Scholar

[6] Li Fei-Fei , Fergus R, Perona P. One-shot learning of object categories. IEEE Trans Pattern Anal Machine Intell, 2006, 28: 594-611 CrossRef Google Scholar

[7] Sung F, Yang Y, Zhang L, et al. Learning to compare: relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 1199--1208. Google Scholar

[8] Munkhdalai T, Yu H. Meta networks. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2017. 2554--2563. Google Scholar

[9] Snell J, Swersky K, Zemel R S. Prototypical networks for few-shot learning. In: Proceedings of Advances in Neural Information Processing Systems, Long Beach, 2017. 4077--4087. Google Scholar

[10] Vinyals O, Blundell C, Lillicrap T, et al. Matching networks for one shot learning. In: Proceedings of Advances in Neural Information Processing Systems, Barcelona, 2016. 3630--3638. Google Scholar

[11] Koch G, Zemel R, Salakhutdinov R. Siamese neural networks for one-shot image recognition. In: Proceedings of ICML Deep Learning Workshop, 2015. Google Scholar

[12] Yu M, Guo X, Yi J, et al. Diverse few-shot text classification with multiple metrics. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, 2018. 1206--1215. Google Scholar

[13] Han X, Zhu H, Yu P, et al. Fewrel: a large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, 2018. 4803--4809. Google Scholar

[14] Gao T, Han X, Liu Z, et al. Hybrid attention-based prototypical networks for noisy few-shot relation classification. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence, 2019. 6407--6414. Google Scholar

[15] Ye Z, Ling Z. Multi-level matching and aggregation network for few-shot relation classification. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, 2019. 2872--2881. Google Scholar

[16] Bao Y, Wu M, Chang S, et al. Few-shot text classification with distributional signatures. In: Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, 2020. Google Scholar

[17] Seo M, Kembhavi A, Farhadi A, et al. Bidirectional attention flow for machine comprehension. 2016,. arXiv Google Scholar

[18] Yang Z, Yang D, Dyer C, et al. Hierarchical attention networks for document classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016. 1480--1489. Google Scholar

[19] Tao H, Tong S, Zhao H, et al. A radical-aware attention-based model for chinese text classification. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence, 2019. Google Scholar

[20] Miller E G, Matsakis N E, Viola P A. Learning from one example through shared densities on transforms. In: Proceedings of Conference on Computer Vision and Pattern Recognition, Hilton Head, 2000. 1464--1471. Google Scholar

[21] Santoro A, Bartunov S, Botvinick M, et al. Meta-learning with memory-augmented neural networks. In: Proceedings of the 33rd International Conference on Machine Learning, New York City, 2016. 1842--1850. Google Scholar

[22] Mishra N, Rohaninejad M, Chen X, et al. A simple neural attentive meta-learner. In: Proceedings of the 6th International Conference on Learning Representations, Vancouver, 2018. Google Scholar

[23] Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning, Sydney, 2017. 1126--1135. Google Scholar

[24] Al-Shedivat M, Bansal T, Burda Y, et al. Continuous adaptation via meta-learning in nonstationary and competitive environments. In: Proceedings of the 6th International Conference on Learning Representations, Vancouver, 2018. Google Scholar

[25] Bertinetto L, Henriques J F, Torr P H S, et al. Meta-learning with differentiable closed-form solvers. In: Proceedings of the 7th International Conference on Learning Representations, New Orleans, 2019. Google Scholar

[26] Lam W, Lai K Y. A meta-learning approach for text categorization. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001. 303--309. Google Scholar

[27] Jiang X, Havaei M, Chartrand G, et al. Attentive task-agnostic meta-learning for few-shot text classification. In: Proceedings of International Conference on Learning Representations, 2019. Google Scholar

[28] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of Advances in Neural Information Processing Systems, Long Beach, 2017. 5998--6008. Google Scholar

[29] Ji G, Liu K, He S, et al. Distant supervision for relation extraction with sentence-level attention and entity descriptions. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, 2017. 3060--3066. Google Scholar

[30] Wu L, Zhang H, Yang Y, et al. Dynamic prototype selection by fusing attention mechanism for few-shot relation classification. In: Proceedings of the 12th Asian Conference Intelligent Information and Database Systems, Phuket, 2020. 431--441. Google Scholar

[31] Pennington J, Socher R, Manning C D. Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, 2014. 1532--1543. Google Scholar

[32] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. 2018,. arXiv Google Scholar

[33] Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Computation, 1997, 9: 1735-1780 CrossRef Google Scholar

[34] Lewis D. Reuters-21578 text categorization test collection, distribution 1.0. http://www.research/.att.com, 1997. Google Scholar

[35] Misra R. News category dataset, 2018. Google Scholar

[36] Chen W, Liu Y, Kira Z, et al. A closer look at few-shot classification. In: Proceedings of the 7th International Conference on Learning Representations, New Orleans, 2019. Google Scholar

[37] Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res, 2014, 15: 1929--1958. Google Scholar

[38] Maaten L V D, Hinton G. Visualizing data using t-SNE. J Mach Learn Res, 2008, 9: 2579--2605. Google Scholar

[39]

  • Figure 1

    (Color online) The proposed model architecture of textsfBACKxspace.

  • Figure 2

    (Color online) The performance trend of few-shot models along the increase of testing classes, where we set the range of the number of classes from 5 to 10. (a) The results under 1-shot setting on three datasets; (b) the results under 5-shot setting on three datasets.

  • Figure 3

    (Color online) The performance of textsfBACKxspace given different values of (a) $\gamma$ and (b) $\lambda$. 5w1s stands for the 5-way 1-shot setting, and 5w5s stands for the 5-way 5-shot setting. The performance of textsfBACKxspace on three datasets is presented.

  • Figure 4

    (Color online) Comparison between support instance embeddings trained by three models: (a) textsfBACKxspace without Bi-Att and LM; (b) textsfBACKxspace without LM; (c) textsfBACKxspace. A $5$-way $5$-shot testing batch is sampled from Reuters dataset, and the ground truth of the given query is cls. 5, where cls. is the abbreviation of class.

  • Figure 5

    (Color online) Comparison between query instance embeddings trained by three models: (a) textsfBACKxspace without Bi-Att and LM; (b) textsfBACKxspace without LM; (c) textsfBACKxspace. We sampled $50$ queries for each class from a $5$-way $5$-shot classification scenario on FewRel dataset.

  • Figure 6

    (Color online) Two visualized examples of the similarity matrix between the query and the support instance. Owing to the limited space, these two cases are selected from HuffPost dataset, where the text is relatively short. In the heat maps, words in the support instance are located in the horizontal direction, while words in the query are placed vertically. Deeper color signifies the higher similarity score.

  • Table 1  

    Table 1Statistics of datasets

    Dataset # train cls. # valid cls. # test cls. # inst./cls. len (Average/Maximum)
    Reuters 15 5 11 20 185.73/936
    HuffPost 20 5 16 900 11.48/44
    FewRel 65 5 10 700 24.95/38
  • Table 2  

    Table 2Results of $5$-way $1$-shot and $5$-way $5$-shot classification on three datasets using Glove word embeddings$^{\rm~a)}$

    Method Reuters HuffPost FewRel
    $5$-way $1$-shot $5$-way $5$-shot $5$-way $1$-shot $5$-way $5$-shot $5$-way $1$-shot $5$-way $5$-shot
    MetaNet 51.46$\pm$0.45 62.84$\pm$0.61 30.57$\pm$0.62 41.74$\pm$0.52 58.64$\pm$0.44 79.23$\pm$0.75
    SNAIL 53.16$\pm$0.53 63.57$\pm$0.88 31.55$\pm$0.71 41.59$\pm$0.62 59.23$\pm$0.53 78.67$\pm$0.56
    MAML 65.51$\pm$0.19 83.67$\pm$0.89 34.19$\pm$0.79 50.56$\pm$0.67 52.36$\pm$0.71 65.52$\pm$0.84
    FT 71.88$\pm$0.88 78.46$\pm$0.63 39.64$\pm$0.28 59.70$\pm$0.77 71.88$\pm$0.11 75.71$\pm$0.29
    Proto 66.77$\pm$0.12 71.58$\pm$0.18 32.60$\pm$0.18 47.03$\pm$0.17 52.85$\pm$0.13 65.85$\pm$0.49
    RR 72.45$\pm$0.42 83.93$\pm$0.56 36.61$\pm$0.48 50.13$\pm$0.68 57.18$\pm$0.43 70.32$\pm$0.48
    ARR 68.59$\pm$0.43 88.23$\pm$0.87 39.98$\pm$0.53 58.75$\pm$0.42 60.03$\pm$0.37 78.83$\pm$0.51
    textsfBACKxspace 75.22$\pm$0.14 88.01$\pm$0.11 42.63$\pm$0.31 60.76$\pm$0.38 78.43$\pm$0.26 88.64$\pm$0.22
    w/o Bi-Att 72.13$\pm$0.23 86.42$\pm$0.38 38.48$\pm$0.28 54.79$\pm$0.22 76.41$\pm$0.19 84.56$\pm$0.23
    w/o In-Att 75.22$\pm$0.14 86.88$\pm$0.37 42.63$\pm$0.31 58.62$\pm$0.42 78.43$\pm$0.26 86.77$\pm$0.31
    w/o LM 74.01$\pm$0.26 85.33$\pm$0.11 41.34$\pm$0.27 56.71$\pm$0.52 76.37$\pm$0.45 86.32$\pm$0.13
    w/o Shared 74.58$\pm$0.24 85.61$\pm$0.45 41.97$\pm$0.32 57.48$\pm$0.26 77.01$\pm$0.22 86.37$\pm$0.28
    MLP$\rightarrow$Dot 74.44$\pm$0.41 87.67$\pm$0.46 41.24$\pm$0.38 58.89$\pm$0.48 76.77$\pm$0.21 87.23$\pm$0.18
    MLP$\rightarrow$Cos 73.88$\pm$0.35 86.64$\pm$0.35 41.05$\pm$0.44 57.84$\pm$0.48 77.32$\pm$0.24 86.65$\pm$0.36

    a

  • Table 3  

    Table 3Results of $5$-way $1$-shot and $5$-way $5$-shot classification on sentence-level datasets using BERT embeddings

    Method HuffPost FewRel
    $5$-way $1$-shot $5$-way $5$-shot $5$-way $1$-shot $5$-way $5$-shot
    MAML 33.19$\pm$0.22 49.87$\pm$0.41 47.60$\pm$0.82 70.56$\pm$1.77
    FT 38.33$\pm$0.41 58.62$\pm$0.28 61.22$\pm$0.45 80.34$\pm$0.64
    Proto 33.49$\pm$0.25 46.70$\pm$0.23 62.54$\pm$0.59 76.53$\pm$0.98
    RR 37.48$\pm$0.48 50.72$\pm$0.62 66.41$\pm$0.71 78.56$\pm$0.71
    ARR 39.36$\pm$0.14 58.24$\pm$0.17 70.38$\pm$0.35 87.58$\pm$0.24
    textsfBACKxspace 43.21$\pm$0.19 60.77$\pm$0.27 80.85$\pm$0.25 90.67$\pm$0.42