logo

SCIENCE CHINA Information Sciences, Volume 63 , Issue 8 : 182104(2020) https://doi.org/10.1007/s11432-019-2771-0

Important sampling based active learning for imbalance classification

More info
  • ReceivedSep 26, 2019
  • AcceptedJan 19, 2020
  • PublishedJul 7, 2020

Abstract

Imbalance in data distribution hinders the learning performance of classifiers. To solve this problem, a popular type of methods is based on sampling (including oversampling for minority class and undersampling for majority class) so that the imbalanced data becomes relatively balanced data. However, they usually focus on one sampling technique, oversampling or undersampling. Such strategy makes the existing methods suffer from the large imbalance ratio (the majority instances size over the minority instances size). In this paper, an active learning framework is proposed to deal with imbalanced data by alternative performing important sampling (ALIS), which consists of selecting important majority-class instances and generating informative minority-class instances. In ALIS, two important sampling strategies affect each other so that the selected majority-class instances provide much clearer information in the next oversampling process, meanwhile the generated minority-class instances provide much more sufficient information for the next undersampling procedure. Extensive experiments have been conducted on real world datasets with a large range of imbalance ratio to verify ALIS. The experimental results demonstrate the superiority of ALIS in terms of several well-known evaluation metrics by comparing with the state-of-the-art methods.


Acknowledgment

This work was supported in part by National Natural Science Foundation of China (Grant Nos. 61822601, 61773050, 61632004, 61972132), Beijing Natural Science Foundation (Grant No. Z180006), National Key Research and Development Program (Grant No. 2017YFC1703506), Fundamental Research Funds for the Central Universities (Grant Nos. 2019JBZ110, 2019YJS040), Youth Foundation of Hebei Education Department (Grant No. QN2018084), Science and Technology Foundation of Hebei Agricultural University (Grant No. LG201804), and Research Project for Self-cultivating Talents of Hebei Agricultural University (Grant No. PY201810).


Supplement

Appendixes A–C.


References

[1] Xu C, Tao D, Xu C. Robust extreme multi-label learning. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. 1275--1284. Google Scholar

[2] Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, 2017. 2980--2988. Google Scholar

[3] Batuwita R, Palade V. Efficient resampling methods for training support vector machines with imbalanced datasets. In: Proceedings of the International Joint Conference on Neural Networks, 2010. 1--8. Google Scholar

[4] Peng Y. Adaptive sampling with optimal cost for class-imbalance learning. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence, 2015. 2921--2927. Google Scholar

[5] Attenberg J, Ertekin S. Class imbalance and active learning. iinde Imbalanced Learning: Foundations, Algorithms, and Applications, 2013. 101--149. Google Scholar

[6] Guo J, Wan X, Lin H, et al. An active learning method based on mistake sampling for large scale imbalanced classification. In: Proceedings of International Conference on Service Systems and Service Management, 2017. 1--6. Google Scholar

[7] Stefanowski J. Dealing with data difficulty factors while learning from imbalanced data. In: Challenges in Computational Statistics and Data Mining. Berlin: Springer, 2016. 333--363. Google Scholar

[8] Alejo R, Valdovinos R M, García V. A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios. Pattern Recognition Lett, 2013, 34: 380-388 CrossRef Google Scholar

[9] Cheng F, Zhang J, Wen C. Cost-Sensitive Large margin Distribution Machine for classification of imbalanced data. Pattern Recognition Lett, 2016, 80: 107-112 CrossRef Google Scholar

[10] Chung Y A, Lin H T, Yang S W. Cost-aware pre-training for multiclass cost-sensitive deep learning. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence, 2016. 1411--1417. Google Scholar

[11] Ren Y, Zhao P, Sheng Y, Yao D, Xu Z. Robust softmax regression for multi-class classification with self-paced learning. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, 2017. 2641--2647. Google Scholar

[12] Chawla N V, Bowyer K W, Hall L O. SMOTE: Synthetic Minority Over-sampling Technique. jair, 2002, 16: 321-357 CrossRef Google Scholar

[13] Han H, Wang W Y, Mao B H. Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: Advances in Intelligent Computing. Berlin: Springer, 2005. 878--887. Google Scholar

[14] Tang B, He H. Kerneladasyn: kernel based adaptive synthetic data generation for imbalanced learning. In: Proceedings of IEEE Congress on Evolutionary Computation, 2015. 664--671. Google Scholar

[15] Zhou C, Liu B, Wang S. Cmo-smote: misclassification cost minimization oriented synthetic minority oversampling technique for imbalanced learning. In: Proceedings of the 8th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), 2016. 353--358. Google Scholar

[16] Barua S, Islam M M, Yao X. MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning. IEEE Trans Knowl Data Eng, 2014, 26: 405-425 CrossRef Google Scholar

[17] Yuan J, Li J, Zhang B. Learning concepts from large scale imbalanced data sets using support cluster machines. In: Proceedings of the 14th ACM International Conference on Multimedia, 2006. 441--450. Google Scholar

[18] Haibo He , Garcia E A. Learning from Imbalanced Data. IEEE Trans Knowl Data Eng, 2009, 21: 1263-1284 CrossRef Google Scholar

[19] Tahir M A, Kittler J, Yan F. Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recognition, 2012, 45: 3738-3750 CrossRef Google Scholar

[20] Galar M, Fernández A, Barrenechea E. EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognition, 2013, 46: 3460-3471 CrossRef Google Scholar

[21] Thanathamathee P, Lursinsap C. Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and AdaBoost techniques. Pattern Recognition Lett, 2013, 34: 1339-1347 CrossRef Google Scholar

[22] Settles B. Active Learning Literature Survey. Technical Report. University of Wisconsin-Madison Department of Computer Sciences, 2009. Google Scholar

[23] Lughofer E, Weigl E, Heidl W. Integrating new classes on the fly in evolving fuzzy classifier designs and their application in visual inspection. Appl Soft Computing, 2015, 35: 558-582 CrossRef Google Scholar

[24] Weigl E, Heidl W, Lughofer E. On improving performance of surface inspection systems by online active learning and flexible classifier updates. Machine Vision Appl, 2016, 27: 103-127 CrossRef Google Scholar

[25] Pratama M, Dimla E, Lai C Y. Metacognitive learning approach for online tool condition monitoring. J Intell Manuf, 2019, 30: 1717-1737 CrossRef Google Scholar

[26] Ertekin S, Huang J, Bottou L, Giles L. Learning on the border: active learning in imbalanced data classification. In: Proceedings of the 16th ACM Conference on Information and Knowledge Management, 2007. 127--136. Google Scholar

[27] Batuwita R, Palade V. Class imbalance learning methods for support vector machines. Imbalanced Learning: Foundations, Algorithms, and Applications, 2013. 83. Google Scholar

[28] Sangyoon Oh , Min Su Lee , Byoung-Tak Zhang . Ensemble learning with active example selection for imbalanced biomedical data classification.. IEEE/ACM Trans Comput Biol Bioinf, 2011, 8: 316-325 CrossRef PubMed Google Scholar

[29] Chen Y, Mani S. Active learning for unbalanced data in the challenge with multiple models and biasing. In: Proceedings of Workshop on Active Learning and Experimental Design, 2011. 113--126. Google Scholar

[30] Zhang X, Yang T, Srinivasan P. Online asymmetric active learning with imbalanced data. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. 2055--2064. Google Scholar

[31] Zhang T, Zhou Z H. Large margin distribution machine. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014. 313--322. Google Scholar

[32] Roweis S. Boltzmann machines. Lecture notes, 1995. Google Scholar

[33] Yang Y, Ma Z, Nie F, et al. Multi-class active learning by uncertainty sampling with diversity maximization. Int J Comput Vision, 2015, 113: 113--127. Google Scholar

[34] Asuncion A, Newman D. Uci machine learning repository, 2007. Google Scholar

[35] Alcalá-Fdez J, Fernández A, Luengo J, et al. Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Logic Soft Comput, 2010, 17: 255--287. Google Scholar

[36] Sun Z, Song Q, Zhu X. A novel ensemble method for classifying imbalanced data. Pattern Recognition, 2015, 48: 1623-1637 CrossRef Google Scholar

[37] Yan Q, Xia S, Meng F. Optimizing cost-sensitive svm for imbalanced data: connecting cluster to classification. 2017,. arXiv Google Scholar

[38] Wu F, Jing X Y, Shan S, et al. Multiset feature learning for highly imbalanced data classification. In: Prcoeedings of the 31st AAAI Conference on Artificial Intelligence, 2017. Google Scholar

[39] More A. Survey of resampling techniques for improving classification performance in unbalanced datasets. 2016,. arXiv Google Scholar

  • Figure 1

    (Color online) The schematic diagram of ALIS framework. The classifier is initially trained by all positive instances $P_{\rm~active}^{0}$ and equal amount of random negative instances $N_{\rm~active}^0$, and is updated iteratively according to new selected negative points or new generated positive points.

  •   
  •   
  •   
  • Table 1  

    Table 1Notations in ALIS framework

    Notation Description
    $\mathcal{P}$ Original positive class set
    $n^+$ The number of positive instances
    $\mathcal{N}$ Original negative class set
    $n^-$ The number of negative instances
    $\mathcal{D}$ Original training set and $\mathcal{D}=\mathcal{P}~\cup~\mathcal{N}$
    $n$ The number of training instances and $n=n^+~+~n^-$
    $\mathcal~P_{\rm~active}^j$ Generated synthetic positive class set in the $j$th iteration
    $\mathcal~N_{\rm~active}^j$Selected negative class set in the $j$th iteration
    $\mathcal~N_{\rm~pool}$Remaining negative class set after active selection
    $\mathcal~P_{\rm~active}$ Generated synthetic positive class set, $\mathcal~P_{\rm~active}~=~\bigcup_{j}~\mathcal~P_{\rm~active}^{j}$
    $\mathcal~N_{\rm~active}$Selected negative class set, $\mathcal~N_{\rm~active}~=~\bigcup_{j}~\mathcal~N_{\rm~active}^{j}$
    $\omega$ A linear predictor
    $f$ A linear model
    $\lambda_1$The trade-off parameter for controlling the margin variance
    $\lambda_2$The trade-off parameter for controlling the margin mean
  •   

    Algorithm 1 Important undersampling algorithm

    Input: ${\rm~Classifier}^{j}$, pool negative dataset $\mathcal~N_{\rm~pool}$, batchsize. Output: actively selected negative dataset $\mathcal~N_{\rm~active}^{j}$.

    Initialize times = 0; ${\rm~ratio}_1=1$; ${\rm~ratio}_2=0$; $\mathcal~N_{\rm~pool}^{\prime}$: order $\mathcal~N_{\rm~pool}$ by the according distance between instances and decision boundary of ${\rm~Classifier}^j$;

    while ${\rm~ratio}_{2}~<~{\rm~ratio}_{1}$ do

    times = times + 1;

    $\mathcal{N_\text{1}}~$= top $~\sharp({\rm~times}~\times~{\rm~batchsize})$ instances in $\mathcal~N_{\rm~pool}^{\prime}$;

    $\mathcal~N_{2}~$= top $~\sharp((~{\rm~times}~+~1)~\times~{\rm~batchsize})$ instances in $\mathcal~N_{\rm~pool}^{\prime}$;

    Calculate ${\rm~ratio}_1$ and ${\rm~ratio}_2$ of $\mathcal~N_{1}$ and $\mathcal~N_{2}$ according to (8) respectively;

    end while

    $\mathcal~N_{\rm~active}^{j}$ = $\mathcal~N_{1}$.

  •   
  •   
  •   

    Algorithm 2 Important oversampling algorithm

    Input: $\mathcal~P_{\rm~active}$, $\mathcal~N_{\rm~active}$, $k$. Output: synthetic minority dataset $\mathcal~P_{\rm~active}^{j}$.

    ;

    Set the bandwidth $h_i=\min~{\rm~dis}({{\boldsymbol~x}_i},{\rm~NN}({{\boldsymbol~x}_i}))$;

    Identify informative minority-class set $\mathcal{P^\text{info}}$ via (9);

    for ${\boldsymbol~x}_{i}~\in~\mathcal{P^\text{info}}$

    Set the mixture weight $\xi_i$ via (11)

  • Table 2  

    Table 2Description of the datasets

    Dataset $n$ $m$ $n^-$ $n^+$ ratio ($\frac{n^-}{n^+}$)
    haberman 306 3 225 81 2.8
    libra 360 90288724
    glass6 214 9 185 29 6.38
    ecoli3 336 7 301 35 8.6
    yeast0256vs3789 1004 8 905 99 9.14
    Satimage 64353658096269.27
    balance 625 4 576 49 11.8
    shuttlec0vsc4 1829 9 1706 123 13.87
    Letter-a 20000161921178924.34
    yeast4 1484 8 1433 51 28.1
    yeast6 1484 8 1449 35 41.4
    abalone19 4174 7 4142 32 129.44
  • Table 3  

    Table 3Analysis of variance (ANOVA) test and winning times of pairwise t-test (in bracket) between ALIS and the baseline on twelve real-world datasets

    Metric habermanlibraglass6ecoli3yeast0256vs3789Satimage
    Precision-majority4.79E$-$03 (4)1.18E$-$06 (4)0.64 (0)3.86E$-$10 (2)2.48E$-$05 (2)1.36E$-$46 (5)
    Recall-minority9.36E$-$10 (4)2.38E$-$06 (2)0.62 (0)1.21E$-$10 (2)7.28E$-$09 (4)5.19E$-$31 (5)
    F$_{\rm~macro}$2.79E$-$07(5)0.0020(2)0.0011(2)5.44E$-$02(4)1.27E$-$08(3)2.00E$-$36(5)
    AUC0.033 (2)3.45E$-$09 (3)1.79E$-$09 (2)5.45E$-$15 (3)7.83E$-$09 (3)2.59E$-$06 (3)
    Metricbalanceshuttlec0vsc4Letter-ayeast4yeast6abalone19
    Precision-majority 8.37E$-$05 (4)7.20E$-$12 (4)8.70E$-$09 (4)4.36E$-$13 (3)1.30E$-$10 (3)0.15 (3)
    Recall-minority4.53E$-$05 (3)4.14E$-$12 (4)1.82E$-$08 (4)6.85E$-$17 (3)5.23E$-$14 (3)1.32E$-$07 (3)
    F$_{\rm~macro}$0.0249(1)1.40E$-$07(4)1.79E$-$27(4)2.74E$-$08(2)3.72E$-$06(2)0.0428(0)
    AUC2.01E$-$06 (2)5.15E$-$17 (2)8.43E$-$22 (3)0.8846 (1)4.07E$-$05 (2)4.48E$-$06 (2)

Copyright 2020  CHINA SCIENCE PUBLISHING & MEDIA LTD.  中国科技出版传媒股份有限公司  版权所有

京ICP备14028887号-23       京公网安备11010102003388号