logo

SCIENTIA SINICA Informationis, Volume 51 , Issue 1 : 1(2021) https://doi.org/10.1360/SSI-2020-0170

Learning from distribution-changing data streams via decision tree model reuse

More info
  • ReceivedJun 9, 2020
  • AcceptedAug 5, 2020
  • PublishedDec 25, 2020

Abstract


Funded by

国家自然科学基金(61921006)


References

[1] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521: 436--444. Google Scholar

[2] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 770--778. Google Scholar

[3] Devlin J, Chang M, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019. 4171--4186. Google Scholar

[4] Bifet A, Gavaldà R. Learning from time-changing data with adaptive windowing. In: Proceedings of the 7th SIAM International Conference on Data Mining (SDM), 2007. 443--448. Google Scholar

[5] Kuncheva L I, ?liobait? I. On the window size for classification in changing environments. IDA, 2009, 13: 861-872 CrossRef Google Scholar

[6] Ross G J, Adams N M, Tasoulis D K. Exponentially weighted moving average charts for detecting concept drift. Pattern Recognition Lett, 2012, 33: 191-198 CrossRef Google Scholar

[7] Koychev I. Gradual forgetting for adaptation to concept drift. In: Proceedings of Proceedings of ECAI 2000 Workshop on Current Issues in Spatio-Temporal Reasoning, 2000. 101--106. Google Scholar

[8] Zhao P, Wang X, Xie S. Distribution-Free One-Pass Learning. IEEE Trans Knowl Data Eng, 2019, : 1-1 CrossRef Google Scholar

[9] Kolter J Z and Maloof M A. Dynamic weighted majority: a new ensemble method for tracking concept drift. In: Proceedings of Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM), 2003. 123--130. Google Scholar

[10] Kolter J Z and Maloof M A. Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res, 2007, 8: 2755--2790. Google Scholar

[11] Gomes H M, Barddal J P, Enembreck F. A Survey on Ensemble Learning for Data Stream Classification. ACM Comput Surv, 2017, 50: 1-36 CrossRef Google Scholar

[12] Gama J, ?liobait? I, Bifet A. A survey on concept drift adaptation. ACM Comput Surv, 2014, 46: 1-37 CrossRef Google Scholar

[13] Kuncheva L I, ?liobait? I. On the window size for classification in changing environments. IDA, 2009, 13: 861-872 CrossRef Google Scholar

[14] Anagnostopoulos C, Tasoulis D K, Adams N M. Online linear and quadratic discriminant analysis with adaptive forgetting for streaming classification. Statistical Analy Data Min, 2012, 5: 139-166 CrossRef Google Scholar

[15] Jaber G, Cornuéjols A, Tarroux P. Online learning: Searching for the best forgetting strategy under concept drift. In: Proceedings of the 20th International Conference on Neural Information Processing (ICONIP), 2013. 400--408. Google Scholar

[16] Zhou Z-H. Ensemble Methods: Foundations and Algorithms. Boca Raton: Chapman & Hall/CRC Press, 2012. Google Scholar

[17] Kolter J Z and Maloof M A. Using additive expert ensembles to cope with concept drift. In: Proceedings of the 22nd International Conference on Machine Learning (ICML), 2005. 449--456. Google Scholar

[18] Elwell R, Polikar R. Incremental Learning of Concept Drift in Nonstationary Environments. IEEE Trans Neural Netw, 2011, 22: 1517-1531 CrossRef Google Scholar

[19] Schapire R E, Freund Y. Boosting: Foundations and Algorithms. Cambridge: The MIT Press, 2012. Google Scholar

[20] Zhou Z H. Learnware: on the future of machine learning. Front Comput Sci, 2016, 10: 589-590 CrossRef Google Scholar

[21] Schölkopf B, Herbrich R, Smola A J. A generalized representer theorem. In: Proceedings of the 14th Annual Conference Computational Learning Theory (COLT), 2001. 416--426. Google Scholar

[22] Tommasi T, Orabona F, Caputo B. Learning Categories From Few Examples With Multi Model Knowledge Transfer. IEEE Trans Pattern Anal Mach Intell, 2014, 36: 928-941 CrossRef Google Scholar

[23] Yang J, Yan R, Hauptmann A G. Cross-domain video concept detection using adaptive svms. In: Proceedings of the 15th ACM International Conference on Multimedia (ACM MM), 2007. 188--197. Google Scholar

[24] Tommasi T, Orabona F, Caputo B. Safety in numbers: learning categories from few examples with multi model knowledge transfer. In: Proceedings of the Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010. 3081--3088. Google Scholar

[25] Tommasi T, Orabona F, Caputo B. Learning Categories From Few Examples With Multi Model Knowledge Transfer. IEEE Trans Pattern Anal Mach Intell, 2014, 36: 928-941 CrossRef Google Scholar

[26] Kuzborskij I, Orabona F. Stability and hypothesis transfer learning. In: Proceedings of the 30th International Conference on Machine Learning (ICML), 2013. 942--950. Google Scholar

[27] Kuzborskij I, Orabona F. Fast rates by transferring from auxiliary hypotheses. Mach Learn, 2017, 106: 171-195 CrossRef Google Scholar

[28] Zhao P, Cai L W, Zhou Z H. Handling concept drift via model reuse. Mach Learn, 2020, 109: 533-568 CrossRef Google Scholar

[29] Du S S, Koushik J, Singh A, et al. Hypothesis transfer learning via transformation functions. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 574--584. Google Scholar

[30] Wu X, Zhou Z. Model reuse with domain knowledge. Sci Sin-Inf, 2017, 47: 1483-1492 CrossRef Google Scholar

[31] Wu X, Zhou Z. Model reuse with domain knowledge. Sci Sin-Inf, 2017, 47: 1483-1492 CrossRef Google Scholar

[32] Segev N, Harel M, Mannor S. Learn on Source, Refine on Target: A Model Transfer Learning Framework with Random Forests. IEEE Trans Pattern Anal Mach Intell, 2017, 39: 1811-1824 CrossRef Google Scholar

[33] Ye H J, Zhan D C, Jiang Y, et al. Rectify heterogeneous models with semantic mapping. In: Proceedings of the 35th International Conference on Machine Learning (ICML), 2018. 1904--1913. Google Scholar

[34] Wu X Z, Liu S, Zhou Z-H. Heterogeneous model reuse via optimizing multiparty multiclass margin. In: Proceedings of the 36th International Conference on Machine Learning (ICML), 2019. 6840--6849. Google Scholar

[35] Li N, Tsang I W, Zhou Z H. Efficient Optimization of Performance Measures by Classifier Adaptation. IEEE Trans Pattern Anal Mach Intell, 2013, 35: 1370-1382 CrossRef Google Scholar

[36] Ding Y X, Zhou Z-H. Preference based adaptation for learning objectives. In: Proceedings of Advances in Neural Information Processing Systems, 2018. 7839--7848. Google Scholar

[37] Wu X, Kumar V, Ross Quinlan J. Top 10 algorithms in data mining. Knowl Inf Syst, 2008, 14: 1-37 CrossRef Google Scholar

[38] Chen T and Guestrin C. Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016. 785--794. Google Scholar

[39] Arora S, Hazan E, Kale S. Theor Comput, 2012, 8: 121-164 CrossRef Google Scholar

[40] Cesa-Bianchi N, Lugosi G. Prediction, Learning, and Games. Cambridge: Cambridge University Press, 2006. Google Scholar

[41] Littlestone N, Warmuth M K. The weighted majority algorithm. In: Proceedings of the 30th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 1989. 256--261. Google Scholar

[42] Freund Y, Schapire R E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J Comput Syst Sci, 1997, 55: 119-139 CrossRef Google Scholar

[43] Forman G. Tackling concept drift by temporal inductive transfer. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2006. 252--259. Google Scholar

[44] Sun Y, Tang K, Zhu Z. Concept Drift Adaptation by Exploiting Historical Knowledge. IEEE Trans Neural Netw Learning Syst, 2018, 29: 4822-4832 CrossRef Google Scholar

[45] Vlachos M, Domeniconi C, Gunopulos D, et al. Non-linear dimensionality reduction techniques for classification and visualization. In: Proceedings of Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), 2002. 645--651. Google Scholar

[46] Hou B J, Zhang L, Zhou Z-H. Learning with feature evolvable streams. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 1417--1427. Google Scholar

[47] Hou C, Zhou Z H. One-Pass Learning with Incremental and Decremental Features. IEEE Trans Pattern Anal Mach Intell, 2018, 40: 2776-2792 CrossRef Google Scholar

[48] Mu X, Ting K M, Zhou Z H. Classification Under Streaming Emerging New Classes: A Solution Using Completely-Random Trees. IEEE Trans Knowl Data Eng, 2017, 29: 1605-1618 CrossRef Google Scholar

[49] Cai X Q, Zhao P, Ting K M, et al. Nearest neighbor ensembles: an effective method for difficult problems in streaming classification with emerging new classes. In: Proceedings of the 19th International Conference on Data Mining (ICDM), 2019. Google Scholar

  • Figure 1

    (Color online) Illustration of the CondorForest algorithm

  • Figure 2

    (Color online) Illustration of the decision tree model reuse mechanism. (a) shows the concept and decision tree model of the source domain; (b) demonstrates those of the target domain

  • Figure 3

    (Color online) Robustness analysis

  • Figure 4

    (Color online) Parameter sensitivity analysis on different datasets: (a) Update period $p$; (b) model pool size $K$; (c) step size $\eta$

  • Table 1   Basic statistics of datasets involved in the experiments
    Dataset # instance Dim # class Dataset # instance Dim # class
    textsfCIR500G 60000 3 2 textsfGasSensor 4450 129 6
    textsfSINE500G 60000 2 2 textsfPowersupply 29928 2 2
    textsfLuxembourg 1900 32 2 textsfElectricity 45312 8 2
    textsfWeather 18159 8 2 textsfCovertype 581012 54 2
  •   

    Algorithm 1 CondorForest

    Require:Data stream $\{(\x_1,y_1),\ldots,(\x_T,y_T)\}$. Update period $p$; model pool size $K$; step size $\eta$; distribution change detector $\mathfrak{D}$.

    Output:Predictive label $\hat{y}_{t}$, $t=~1,\ldots,T$.

    Use the initial data to initialize a decision-tree model $\Tree_1$;

    Initialize the model pool $\mathcal{M}~=~\{~\Tree_1\}$ and weight $w_1~=~1$;

    for $t=1$ series to $T$

    Receive the feature $\x_t$;

    Each model of $\mathcal{M}$ makes the prediction: $\Tree_k(\x_t)$, where $k=1,\ldots,|\mathcal{M}|$;

    Make the prediction $\hat{y}_t~=~\sum_{k=1}^{|\mathcal{M}|}~w_t^k~\Tree_k(\x_t)$;

    Receive the ground-truth label $y_t$, and each model suffers $\ell(\Tree_k(\x_t),~y_t)$;

    Update the model weight according to 1;

    if $(t~\Mod~p)~=~0$ or $\mathfrak{D}$ detects the distribution change then

    Sample a decision tree model $\tilde{\Tree}$ according to the weight distribution;

    Learn a new model $\Tree_{\mathrm{new}}$ via the decision tree model reuse, and then add it into the model pool $\mathcal{M}$;

    Update the weight as the uniform distribution;

    end if

    end for

  • Table 2   Performance comparisons on two non-linear synthetic datasets and six real-world datasets
    Dataset DWM DTEL TIX CondorSVM CondorForest
    textsfCIR500G 77.09 $\pm$ 0.71 $\bullet$ 79.03 $\pm$ 0.34 $\bullet$ 66.38 $\pm$ 0.85 $\bullet$ 68.41 $\pm$ 0.87 79.60 $\pm$ 1.11
    textsfSIN500G 66.99 $\pm$ 0.10 $\bullet$ 74.93 $\pm$ 0.34 $\circ$ 62.73 $\pm$ 0.14 $\bullet$ 65.68 $\pm$ 0.12 73.98 $\pm$ 0.90
    textsfLuxembourg 90.42 $\pm$ 0.55 $\bullet$ 100.0 $\pm$ 0.00 $~$ 90.99 $\pm$ 0.97 $\bullet$ 99.98 $\pm$ 0.03 100.0 $\pm$ 0.00
    textsfWeather 70.83 $\pm$ 0.49 $\bullet$ 68.92 $\pm$ 0.27 $\bullet$ 70.21 $\pm$ 0.33 $\bullet$ 79.37 $\pm$ 0.26 73.65 $\pm$ 0.66
    textsfGasSensor 76.61 $\pm$ 0.36 $\bullet$ 63.82 $\pm$ 3.64 $\bullet$ 43.40 $\pm$ 2.88 $\bullet$ 81.57 $\pm$ 3.77 76.25 $\pm$ 4.38
    textsfPowersupply 72.09 $\pm$ 0.29 $\bullet$ 69.90 $\pm$ 0.38 $\bullet$ 68.34 $\pm$ 0.16 $\bullet$ 72.82 $\pm$ 0.29 73.23 $\pm$ 0.91
    textsfElectricity 78.03 $\pm$ 0.17 $\bullet$ 81.05 $\pm$ 0.35 $\bullet$ 58.44 $\pm$ 0.71 $\bullet$ 84.73 $\pm$ 0.33 87.88 $\pm$ 1.04
    textsfCovertype 74.17 $\pm$ 0.87 $\bullet$ 69.43 $\pm$ 1.30 $\bullet$ 64.60 $\pm$ 0.89 $\bullet$ 89.58 $\pm$ 0.14 91.35 $\pm$ 0.24