logo

SCIENTIA SINICA Informationis, Volume 50 , Issue 6 : 781-793(2020) https://doi.org/10.1360/SSI-2020-0005

Meta self-paced learning

More info
  • ReceivedJan 6, 2020
  • AcceptedMar 1, 2020
  • PublishedJun 10, 2020

Abstract

Self-paced learning (SPL) is a learning regime, inspired by human and animal learning processes, that gradually incorporates simple to more complex samples into a training dataset. Recently, SPL has seen significant research progress. However, current SPL algorithms still have critical limitations, such as how to determine the age hyper-parameters (especially the age parameters). Some heuristic strategies based on cross-validation have been designed. In addition, setting these parameters manually has been proposed. However, such strategies are very inefficient, not supported by theoretical evidence, and are very difficult to apply generally in practice. To address these issues, we propose a meta-learning regime for adaptively learning age parameters involved in SPL. Three types of typical SPL algorithms are integrated into the proposed regime, and their accuracy and generalization capability are substantiated through regression and classification experiments, and compared to conventional SPL paradigms that do not include adaptive age parameter tuning.


Funded by

国家自然科学基金(61661166011,11690011,61603292,61721002,U1811461)


References

[1] Bengio Y, Louradour J, Collobert R, et al. Curriculum learning. In: Proceedings of International Conference on Machine Learning, 2009. 41--48. Google Scholar

[2] Guo S, Huang W, Zhang H, et al. Curriculumnet: Weakly supervised learning from large-scale web images. In: Proceedings of European Conference on Computer Vision, 2018. 135--150. Google Scholar

[3] Graves A, Bellemare M G, Menick J, et al. Automated curriculum learning for neural networks. In: Proceedings of International Conference on Machine Learning, 2017. 1311--1320. Google Scholar

[4] Pentina A, Sharmanska V, Lampert C H. Curriculum learning of multiple tasks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015. 5492--5500. Google Scholar

[5] Shu Y, Cao Z, Long M, et al. Transferable curriculum for Weakly-supervised domain adaptation. In: Proceedings of the Association for the Advance of Artificial Intelligence, 2019. 4951--4958. Google Scholar

[6] Gong C, Tao D, Maybank S J. Multi-Modal Curriculum Learning for Semi-Supervised Image Classification.. IEEE Trans Image Process, 2016, 25: 3249-3260 CrossRef PubMed Google Scholar

[7] Turian J, Ratinov L, Bengio Y. Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 2010. 384--394. Google Scholar

[8] Kumar M P, Packer B, Koller D. Self-paced learning for latent variable models. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2010. 1189--1197. Google Scholar

[9] Jiang L, Meng D, Mitamura T, et al. Easy samples first: Self-paced reranking for zero-example multimedia search. In: Proceedings of ACM International Conference on Multimedia, 2014. 547--556. Google Scholar

[10] Zhao Q, Meng D, Jiang L, et al. Self-paced learning for matrix factorization. In: Proceedings of the Association for the Advance of Artificial Intelligence, 2015. 3196--3202. Google Scholar

[11] Meng D, Zhao Q, Jiang L. A theoretical understanding of self-paced learning. Inf Sci, 2017, 414: 319-328 CrossRef Google Scholar

[12] Zhang D, Meng D, Han J. Co-saliency detection via a self-paced multiple-instance learning framework. IEEE Trans Pattern Anal Mach Intell, 2016, 39(5): 865-878. Google Scholar

[13] Dong X, Zheng L, Ma F, et al. Few-example object detection with model communication. IEEE Trans Pattern Anal Mach Intell, 2018, 41(7): 1641-1654. Google Scholar

[14] Jiang L, Meng D, Zhao Q, et al. Self-paced curriculum learning. In: Proceedings of the Association for the Advance of Artificial Intelligence, 2015. 2694--2700. Google Scholar

[15] Supancic J S, Ramanan D. Self-paced learning for long-term tracking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2013. 2379--2386. Google Scholar

[16] Ma Z, Liu S, Meng D. On Convergence Properties of Implicit Self-paced Objective. Inf Sci, 2018, 462: 132-140 CrossRef Google Scholar

[17] Liu S Q, Ma Z L, Meng D Y, et al. Understanding self-paced learning under Concave Conjugacy Theory. Communications in Information and Systems, 2018, 18(1): 1-35. Google Scholar

[18] Bergstra J, Bengio Y. Random search for hyper-parameter optimization. J Mach Learn Res, 2012, 13: 281-305. Google Scholar

[19] Snoek J, Larochelle H, Adams R P. Practical bayesian optimization of machine learning algorithms. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2012. 2951--2959. Google Scholar

[20] Bishop C M. Pattern Recognition and Machine Learning. Berlin: Springer, 2006. Google Scholar

[21] Maclaurin D, Duvenaud D, Adams R. Gradient-based hyperparameter optimization through reversible learning. In: Proceedings of International Conference on Machine Learning, 2015. 2113--2122. Google Scholar

[22] Jiang L, Zhou Z, Leung T, et al. MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels. In: Proceedings of International Conference on Machine Learning. 2018. 2309--2318. Google Scholar

[23] Ren M, Zeng W, Yang B, et al. Learning to Reweight Examples for Robust Deep Learning. In: Proceedings of International Conference on Machine Learning, 2018. 4331--4340. Google Scholar

[24] Shu J, Xie Q, Yi L, et al. Meta-weight-net: learning an explicit mapping for sample weighting. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2019. 1917--1928. Google Scholar

[25] Franceschi L, Frasconi P, Salzo S, et al. Bilevel Programming for Hyperparameter Optimization and Meta-Learning. In: Proceedings of International Conference on Machine Learning, 2018. 1563--1572. Google Scholar

[26] Schmidhuber J. Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks. Neural Computation, 1992, 4: 131-139 CrossRef Google Scholar

[27] Shu J, Xu Z, Meng D. Small sample learning in big data era. 2018,. arXiv Google Scholar

[28] Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of International Conference on Machine Learning, 2017. 1126--1135. Google Scholar

[29] Thrun S, Pratt L. Learning to Learn. Berlin: Springer, 1998. Google Scholar

[30] Colson B, Marcotte P, Savard G. An overview of bilevel optimization. Ann Oper Res, 2007, 153: 235-256 CrossRef Google Scholar

[31] Zhang C Y, Bengio S, Hardt M, et al. Understanding deep learning requires rethinking generalization. In: Proceedings of International Conference on Learning Representations, 2017. Google Scholar

[32] Patrini G, Rozza A, Menon A K, et al. Making deep neural networks robust to label noise: a loss correction approach. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1944--1952. Google Scholar

[33] Zhang Z L, Sabuncu M. Generalized cross entropy loss for training deep neural networks with noisy labels. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2018. 8778--8788. Google Scholar

[34] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 770--778. Google Scholar

  • Figure 1

    (Color online) Variations of learned latent loss functions during learning process. (a) and (b) show the results of regression and classification experiments, respectively

  • Figure 2

    (Color online) Illustrations of sample weight distribution's variations during learning process on CIFAR-10 40% noise among different methods. (a)$\sim$(d) are generated by Meta-Weight-Net, Meta-Hard, Meta-Linear and Meta-Mixture, respectively

  • Figure 3

    (Color online) Illustrations of sample weight distribution's variations during learning process on CIFAR-100 40% noise among different methods. (a)$\sim$(d) are generated by Meta-Weight-Net, Meta-Hard, Meta-Linear and Meta-Mixture, respectively

  • Figure 4

    (Color online) Varying trend of learned age parameter during learning process (a) and the number of essential samples involved in training (b) on CIFAR-10 40% noise of Meta-Linear method, where each subfigure shows the results of different initial values, including 4, 5, 6, 7, 8, 9, 10 in order from left to right. Two curves of (b) represent varying trend of the number of samples whose sample weights are greater than 0 and 0.7, respectively

  •   

    Algorithm 1 自步学习求解算法

    Require:数据$\{x_i,y_i\}_{i=1}^N$, 最大迭代次数$T$, 批样本数$n$;

    Output:模型参数${\boldsymbol~w}$; 初始化: ${\boldsymbol~w}^{0},~{\boldsymbol~v}^{0},\lambda_0,~t=0$;

    if $t<T$ then

    $t=t+1$;

    求解(3)更新${\boldsymbol~v}^{(t)}$;

    求解(4)更新${\boldsymbol~w}^{(t)}$;

    增加$\lambda$;

    end if

  • Table 1   Performance comparison (LS error) on simulation experiment of all compared methods
    Methods LS Hard Meta-HardLinear Meta-Linear Mixture Meta-Mixture
    Test error 86.23 3.26 2.32 2.40 1.56 1.87 1.35
  •   

    Algorithm 2 元自步学习求解算法

    Require:数据$\{x_i,y_i\}_{i=1}^N$,元数据集$\{x_i^{(m)},y_i^{(m)}\}_{i=1}^M$, 最大迭代次数$T$, 批样本数$n$和$m$;

    Output:模型参数${\boldsymbol~w}$; 初始化: ${\boldsymbol~w}^{0},~{\boldsymbol~v}^{0},\Lambda_0,~t=0$;

    if $t<T$ then

    $t=t+1$;

    求解(15)更新$\Lambda^{(t)}$;

    求解(16)更新${\boldsymbol~v}^{(t)}$;

    求解(17)更新${\boldsymbol~w}^{(t)}$;

    end if

  • Table 2   Test accuracy comparison on CIFAR-10 and CIFAR-100 with varying noise rates of all compared methods$^{\rm~a)}$
    DatasetsMethodsSymmetric noiseAsymmetric noise
    Noise rate $\eta$Noise rate $\eta$
    0 0.2 0.4 0.6 0.2 0.4
    CIFAR-10CE 92.89$\pm$0.32 76.83$\pm$2.30 70.77$\pm$2.31 63.21$\pm$4.22 76.83$\pm$2.30 70.77$\pm$2.31
    Forward 93.03$\pm$0.11 86.49$\pm$0.15 80.51$\pm$0.28 75.55$\pm$2.25 87.38$\pm$0.48 78.98$\pm$0.35
    GCE90.03$\pm$0.30 88.51$\pm$0.37 85.48$\pm$0.16 81.29$\pm$0.23 88.55$\pm$0.22 83.31$\pm$0.14
    Meta-Weight-Net 92.04$\pm$0.15 89.19$\pm$0.57 86.10$\pm$0.18 81.31$\pm$0.37 90.33$\pm$0.61 87.54$\pm$0.23
    Hard 91.72$\pm$0.12 85.84$\pm$0.06 81.10$\pm$0.20 74.49$\pm$0.09 84.54$\pm$0.10 81.20$\pm$0.31
    Meta-Hard 92.20$\pm$0.35 88.77$\pm$0.23 85.74$\pm$0.23 80.67$\pm$0.61 87.14$\pm$0.15 85.21$\pm$0.35
    Linear 90.62$\pm$0.54 88.49$\pm$0.40 82.99$\pm$0.08 75.63$\pm$0.55 85.56$\pm$0.13 81.97$\pm$0.48
    Meta-Linear 91.71$\pm$0.36 89.71$\pm$0.27 86.61$\pm$0.30 82.12$\pm$0.22 89.89$\pm$0.14 86.94$\pm$0.23
    Mixture 91.51$\pm$ 0.16 88.63$\pm$0.18 83.77$\pm$0.33 77.77$\pm$0.38 88.52$\pm$0.37 82.58$\pm$0.45
    Meta-Mixture 91.90$\pm$0.15 89.78$\pm$0.30 86.82$\pm$0.13 82.20$\pm$0.12 90.41$\pm$0.24 87.62$\pm$0.05
    CIFAR-100CE70.50$\pm$0.12 50.86$\pm$0.2743.01$\pm$1.16 34.43$\pm$0.94 50.86$\pm$0.27 43.01$\pm$1.16
    Forward67.81$\pm$0.61 63.75$\pm$0.38 57.53$\pm$0.15 46.44$\pm$1.03 63.28$\pm$0.23 57.90$\pm$0.57
    GCE 67.39$\pm$0.12 63.97$\pm$0.43 58.33$\pm$0.35 41.73$\pm$0.36 62.07$\pm$0.41 55.25$\pm$0.09
    Meta-Weight-Net 69.13$\pm$0.33 64.22$\pm$0.28 58.64$\pm$0.4747.43$\pm$0.76 64.22$\pm$0.28 58.64$\pm$0.47
    Hard 68.68$\pm$0.11 58.13$\pm$0.31 53.57$\pm$0.57 39.96$\pm$0.24 60.54$\pm$0.56 51.04$\pm$0.87
    Meta-Hard 68.83$\pm$0.13 63.79$\pm$0.18 55.87$\pm$0.95 44.94$\pm$1.09 63.08$\pm$0.42 56.39$\pm$0.38
    Linear 68.07$\pm$0.23 60.13$\pm$0.14 55.13$\pm$ 0.71 43.28$\pm$0.59 62.11$\pm$0.72 53.73$\pm$0.70
    Meta-Linear 68.42$\pm$1.02 64.86$\pm$0.13 60.03$\pm$0.27 48.46$\pm$0.4164.27$\pm$0.16 57.68$\pm$0.09
    Mixture 69.05$\pm$0.19 61.20$\pm$0.16 56.52$\pm$0.06 43.60$\pm$1.12 62.50$\pm$0.11 54.97$\pm$0.05
    Meta-Mixture 69.17$\pm$0.36 65.24$\pm$0.21 61.12$\pm$0.30 52.09$\pm$0.58 64.71$\pm$0.19 58.86$\pm$0.05

    a

  • Table 3   Test accuracy comparison on CIFAR-10 with 40% symmetric noise of different initial values
    Initial values $\lambda=4$ $\lambda=5$ $\lambda=6$ $\lambda=7$ $\lambda=8$ $\lambda=9$ $\lambda=10$
    Test accuracy (%) 85.34 86.10 86.83 86.75 86.54 85.93 85.64

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备17057255号       京公网安备11010102003388号