logo

SCIENTIA SINICA Informationis, Volume 50 , Issue 8 : 1239-1254(2020) https://doi.org/10.1360/SSI-2019-0110

Clustering method based on sample's stability

More info
  • ReceivedMay 30, 2019
  • AcceptedNov 1, 2019
  • PublishedAug 6, 2020

Abstract

The complexity of data types and distributions leads to an increase in uncertainty of the relationships between data samples, which bring challenges in discovering the cluster structure inherent in a data set. To address this challenge, this paper presents the concept of the sample's stability in a clustering ensemble, which is extended to the area of clustering analysis. We theoretically analyze the rationality of the sample's stability and propose an entropy-based sample's stability measure. Besides, we propose a clustering method based on the sample's stability. The proposed method divides a data set into stable and unstable samples, discovers the cluster structure of the stable samples, and assigns the unstable samples into this structure. The results of experiments on two-dimensional data sets and an image segmentation data set visually demonstrate the rationality of the sample's stability concept and effectiveness of the proposed clustering method based on the sample's stability measure.


Funded by

国家重点研发计划(2018YFB1004300)

山西省重点研发计划(201903D421003)

国家自然科学基金(61672332,U1805263,61872226,61802238,61673249)

山西省海外归国人员研究项目(2017023,2016004)


References

[1] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521: 436-444 CrossRef PubMed ADS Google Scholar

[2] Jain A K, Murty M N, Flynn P J. Data clustering: a review. ACM Comput Surv, 1999, 31: 264-323 CrossRef Google Scholar

[3] Zheng L, Yang Y, Tian Q. SIFT Meets CNN: A Decade Survey of Instance Retrieval.. IEEE Trans Pattern Anal Mach Intell, 2018, 40: 1224-1244 CrossRef PubMed Google Scholar

[4] Dhillon I S, Modha D S. Concept decompositions for large sparse text data using clustering. Machine Learning, 2001, 42: 143-175 CrossRef Google Scholar

[5] Lu D, Tripodis Y, Gerstenfeld L C, et al. Clustering of temporal gene expression data with mixtures of mixed effects models with a penalized likelihood. Bioinformatics, 2018, 35: 778-786. Google Scholar

[6] Zechao Li , Jing Liu , Yi Yang . Clustering-Guided Sparse Structural Learning for Unsupervised Feature Selection. IEEE Trans Knowl Data Eng, 2014, 26: 2138-2150 CrossRef Google Scholar

[7] Erhan D, Bengio Y, Courville A, et al. Why does unsupervised pre-training help deep learning? J Mach Learn Res, 2010, 11: 625-660 DOI: 10.1145/1756006.1756025. Google Scholar

[8] Frey B J, Dueck D. Clustering by Passing Messages Between Data Points. Science, 2007, 315: 972-976 CrossRef PubMed ADS Google Scholar

[9] Rodriguez A, Laio A. Clustering by fast search and find of density peaks. Science, 2014, 344: 1492-1496 CrossRef PubMed ADS Google Scholar

[10] Otto C, Wang D, Jain A K. Clustering Millions of Faces by Identity.. IEEE Trans Pattern Anal Mach Intell, 2018, 40: 289-303 CrossRef PubMed Google Scholar

[11] Dan Zhang , Fei Wang , Luo Si . Maximum margin multiple instance clustering with applications to image and text clustering.. IEEE Trans Neural Netw, 2011, 22: 739-751 CrossRef PubMed Google Scholar

[12] Brockmeier A J, Mu T, Ananiadou S. Self-Tuned Descriptive Document Clustering Using a Predictive Network. IEEE Trans Knowl Data Eng, 2018, 30: 1929-1942 CrossRef Google Scholar

[13] Qian Y, Liang J, Pedrycz W. Positive approximation: An accelerator for attribute reduction in rough set theory. Artificial Intelligence, 2010, 174: 597-618 CrossRef Google Scholar

[14] Qian Y, Li F, Liang J. Space Structure and Clustering of Categorical Data.. IEEE Trans Neural Netw Learning Syst, 2016, 27: 2047-2059 CrossRef PubMed Google Scholar

[15] Fahy C, Yang S, Gongora M. Ant Colony Stream Clustering: A Fast Density Clustering Algorithm for Dynamic Data Streams.. IEEE Trans Cybern, 2019, 49: 2215-2228 CrossRef PubMed Google Scholar

[16] Blomstedt P, Tang J, Xiong J. A Bayesian Predictive Model for Clustering Data of Mixed Discrete and Continuous Type.. IEEE Trans Pattern Anal Mach Intell, 2015, 37: 489-498 CrossRef PubMed Google Scholar

[17] Douzas G, Bacao F. Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning. Expert Syst Appl, 2017, 82: 40-52 CrossRef Google Scholar

[18] Xu J, Han J, Nie F. Re-Weighted Discriminatively Embedded K -Means for Multi-View Clustering. IEEE Trans Image Process, 2017, 26: 3016-3027 CrossRef PubMed ADS Google Scholar

[19] Yao X, Han J, Zhang D. Revisiting Co-Saliency Detection: A Novel Approach Based on Two-Stage Multi-View Spectral Rotation Co-clustering. IEEE Trans Image Process, 2017, 26: 3196-3209 CrossRef PubMed ADS Google Scholar

[20] Li F, Qian Y, Wang J. Clustering ensemble based on sample's stability. Artificial Intelligence, 2019, 273: 37-55 CrossRef Google Scholar

[21] Strehl A L, Ghosh J. Cluster ensembles---a knowledge reuse framework for combining multiple partitions. Journal ofMachine Learning Research, 2003, 3: 583-617. Google Scholar

[22] Li F, Qian Y, Wang J. Multigranulation information fusion: A Dempster-Shafer evidence theory-based clustering ensemble method. Inf Sci, 2017, 378: 389-409 CrossRef Google Scholar

[23] Shannon C E. A Mathematical Theory of Communication. Bell Syst Technical J, 1948, 27: 379-423 CrossRef Google Scholar

[24] Hu Q H, Guo M Z, Yu D R. Information entropy for ordinal classification. Sci China Inf Sci, 2010, 53: 1188-1200 CrossRef Google Scholar

[25] Otsu N. A Threshold Selection Method from Gray-Level Histograms. IEEE Trans Syst Man Cybern, 1979, 9: 62-66 CrossRef Google Scholar

[26] Chan T F, Vese L A. Active contours without edges. IEEE Trans Image Process, 2001, 10: 266-277 CrossRef PubMed ADS Google Scholar

[27] Malinen M I, Mariescu-Istodor R, Fr?nti P. K-means?: Clustering by gradual data transformation. Pattern Recognition, 2014, 47: 3376-3386 CrossRef Google Scholar

[28] Likas A, Vlassis N, J. Verbeek J. The global k-means clustering algorithm. Pattern Recognition, 2003, 36: 451-461 CrossRef Google Scholar

[29] Fan Z, Jiang J, Weng S. Adaptive density distribution inspired affinity propagation clustering. Neural Comput Applic, 2019, 31: 435-445 CrossRef Google Scholar

[30] Xu X, Ding S, Shi Z. An improved density peaks clustering algorithm with fast finding cluster centers. Knowledge-Based Syst, 2018, 158: 65-74 CrossRef Google Scholar

[31] Averbuch-Elor H, Bar N, Cohen-Or D. Border-peeling clustering. IEEE Trans Pattern Anal Mach Intell, 2019. doi: 10.1109/TPAMI.2019.2924953. Google Scholar

[32] Hubert L, Arabie P. Comparing partitions. J Classification, 1985, 2: 193-218 CrossRef Google Scholar

  • Figure 1

    An example with $\mu=3$ and $\sigma=3$. (a) Conditional probability; (b) posterior probability

  • Figure 2

    The cuver of (a) $H_2$ and (b) ${\rm~fe}$ ($t=0.7$)

  • Figure 3

    Experiments on synthetic data sets. (a) 2d2k; (b) Flame; (c) Jain; (d) WingNut

  • Figure 4

    Experiments on image segmentation data sets

  • Figure 5

    (Color online) Nemenyi post-hoc test based on (a) Table 3and (b) Table 4

  • Table 1   Benchmark data sets
    Number Data set Number of samples Number of attributes Number of clusters
    1 Iris 150 4 3
    2 Wine recognition data 178 13 3
    3 Seeds data set 210 7 3
    4 Glass identification database 214 9 6
    5 Ecoli 336 7 8
    6 Cardiotocography data set 2126 40 10
    7 Image segmentation data 2310 19 7
    8 Waveform database generator 5000 21 3
    9 tr45 690 8261 10
    10 tr41 878 7454 10
    11 wap 1560 8460 20
    12 re1 1657 3758 25
  •   

    Algorithm 1 基于样本稳定性的聚类方法

    输入: 数据集${{X}}=\{{{\boldsymbol~x}_1},{{\boldsymbol~x}_2},\ldots,{{\boldsymbol~x}_n}\}$, 预期类个数$k$.

    输出: 聚类结果$C=\{c_1,c_2,\ldots,c_k\}$.

    基于式(6)得到样本对的共现概率矩阵$\boldsymbol{P}=\{p_{ij}|1\leq~i\leq~n,~1\leq~j\leq~n\}$;

    根据式(5)计算$n$个样本的稳定性$\boldsymbol{S}=\{s_1,s_2,\ldots,s_n\}$;

    根据式(7)和(8)得到稳定样本集${\rm~SS}$和不稳定样本集${\rm~NS}$;

    利用AP算法对稳定样本集${\rm~SS}$聚类:$C_{ss}=\{c_1^{\rm ss},c_2^{\rm ss},\ldots,c_{k'}^{\rm ss}\}\leftarrow {\rm AP}({\rm SS})$;

    while $|{\rm~NS}|\neq~0$ do

    根据式(9)得${\rm~NS}^>$;

    更新稳定样本集和不稳定样本集: ${\rm~SS}={\rm~SS}\cup~{\rm~NS}^>$, ${\rm~NS}={\rm~NS}-{\rm~NS}^>$;

    根据式(10)扩充$C_{\rm~ss}$中的每个类簇;

    end while

    if $k' >k$ then

    根据式(11)和层次聚类算法HC合并$C_{{\rm~SS}}$中相似类簇直至类个数为$k$: $C\leftarrow~{\rm~HC}(C_{{\rm~SS}},k)$;

    else

    $C\leftarrow~C_{{\rm~SS}}$.

    end if

  • Table 2   The cross tabulation of $C^b$ and $C^d$
    $C^d$ $C^b$
    $c_1^b$ $c_2^b$ $\cdots$ $c_k^b$ SUM
    $c_1^d$ $n_{11}$ $n_{12}$ $\cdots$ $n_{1k}$ $n_{1\cdot}$
    $c_2^d$ $n_{21}$ $n_{22}$ $\cdots$ $n_{2k}$ $n_{2\cdot}$
    $\vdots$ $\vdots$ $\vdots$ $\ddots$ $\vdots$ $\vdots$
    $c_k^d$ $n_{k1}$ $n_{k2}$ $\cdots$ $n_{kk}$ $n_{k~\cdot}$
    SUM $n_{\cdot~1}$ $n_{\cdot~2}$ $\cdots$ $n_{\cdot~k}$ $n$
  • Table 3   Index NMI from the nine clustering algorithms
    Number K-means K-means* GK-means AP AD-AP DP GDPC BPC SSC
    1 0.7116$\pm$0.0622 0.7309 $\pm$0.0142 0.7419 0.7777 0.7777 0.6586 0.6586 0.7630 _0.8031
    2 0.8414$\pm$0.0118 _0.8423$\pm$0.0184 0.8347 0.7507 0.7315 0.5885 0.7415 0.7955 0.8364
    3 0.6743$\pm$0.0000 0.6725$\pm$0.0037 0.6654 0.6873 0.6790 0.6797 0.5246 0.6959 _0.7043
    4 0.4107$\pm$0.0292 0.3409$\pm$0.0313 0.4008 0.3392 0.3065 0.3265 0.2757 0.3918 _0.4481
    5 0.5939$\pm$0.0283 0.5970$\pm$0.0185 0.6224 0.6311 0.5653 0.4985 0.4773 0.6333 _0.6768
    6 0.8972$\pm$0.0227 0.9177$\pm$0.0368 0.9478 0.9270 0.8118 0.8020 0.8199 _1.0000 0.9797
    7 0.6114$\pm$0.0159 0.6065$\pm$0.0112 0.6109 0.6116 0.6095 0.6599 0.6666 0.6664 _0.6683
    8 0.3642$\pm$0.0000 0.3642$\pm$0.0000 0.3642 0.2989 0.2760 0.2214 0.2128 0.3592 _0.3728
    9 0.4669$\pm$0.0406 0.2708$\pm$0.0244 0.4245 0.4062 0.3681 0.2673 0.3209 0.4491 _0.4752
    10 0.4829$\pm$0.0389 0.3160$\pm$0.0312 0.4292 0.4003 0.3917 0.3507 0.4268 0.4311 _0.5160
    11 _0.5058$\pm$0.0211 0.4668$\pm$0.0163 0.4792 0.3603 0.3603 0.2728 0.2608 0.4523 0.4995
    12 0.4360$\pm$0.0170 0.3031$\pm$0.0148 0.4194 0.4464 0.4461 0.3628 0.3524 0.4192 _0.4477
    Average rank 4.0000 5.9167 4.3333 4.6667 6.2500 7.5000 7.4167 3.5833 1.3333
  • Table 4   Index ARI from the nine clustering algorithms
    Number K-means K-means* GK-means AP AD-AP DP GDPC BPC SSC
    1 0.6589$\pm$0.1179 0.7102$\pm$0.0080 0.7163 0.7445 0.7445 0.4531 0.4531 0.7184 _0.7874
    2 0.8451$\pm$0.0140 0.8477$\pm$0.0196 0.8471 0.7262 0.7144 0.4956 0.7567 0.8025 _0.8516
    3 0.7049$\pm$0.0000 0.7026$\pm$0.0048 0.6934 0.7064 0.6952 0.7076 0.4726 0.7020 _0.7406
    4 _0.2572$\pm$0.0149 0.1859$\pm$0.0278 0.2471 0.1862 0.1583 0.2267 0.2294 0.2397 0.2481
    5 0.4238$\pm$0.0811 0.4180$\pm$0.0334 0.4150 0.4551 0.3412 0.3086 0.2655 0.6142 _0.7080
    6 0.7779$\pm$0.0654 0.8049$\pm$0.0905 0.8521 0.8558 0.6529 0.6323 0.6372 _1.0000 0.9771
    7 0.4711 $\pm$0.0446 0.4736$\pm$0.0354 0.5034 0.5100 0.5117 0.5301 _0.5318 0.5159 0.5260
    8 0.2535$\pm$0.0000 0.2535$\pm$0.0000 0.2536 0.2167 0.1980 0.1897 0.1875 0.2542 _0.2592
    9 _0.3336$\pm$0.0627 0.1855$\pm$0.0421 0.2512 0.2498 0.2217 0.0822 0.1275 0.2848 0.2354
    10 0.3024$\pm$0.0691 0.2698$\pm$0.0305 0.2493 0.2296 0.2349 0.1388 0.2725 0.2695 _0.3066
    11 0.2025$\pm$0.0857 0.1408$\pm$0.0390 0.1792 0.1126 0.1126 0.1756 0.1964 0.1157 _0.2035
    12 0.2451$\pm$0.0181 0.1224$\pm$0.0147 0.1598 0.2066 0.2178 0.1303 0.1660 0.2427 _0.2709
    Average rank 3.9167 5.7500 4.9167 5.3333 6.4167 7.0000 6.2500 3.7500 1.6667

Copyright 2020  CHINA SCIENCE PUBLISHING & MEDIA LTD.  中国科技出版传媒股份有限公司  版权所有

京ICP备14028887号-23       京公网安备11010102003388号