logo

SCIENTIA SINICA Informationis, Volume 47, Issue 11: 1464-1482(2017) https://doi.org/10.1360/N112017-00107

A novel evolving data stream clustering method based on optimization model

More info
  • ReceivedMay 16, 2017
  • AcceptedJun 19, 2017
  • PublishedNov 14, 2017

Abstract

An optimization model based on the fuzzy maximum entropy method is proposed for the data stream evolving clustering problem. In the model, the fuzziness and effectiveness of cluster partition are described by fuzzy membership and information entropy, respectively. An optimization object function is defined. In the sliding window, the clustering processing of the data subset is construed as an optimization problem. In this way, the inner structural features can be depicted effectively, and the continuity between contiguous windows is preserved simultaneously. The solution of the optimization problem is used as the basis of concept drift detection; as a result, the validity of the detection result is guaranteed and the varying trends in cluster structure can be easily captured. In the simulation, artificial and real datasets are used to verify the performance of the proposed method, and existing evolving clustering algorithms are introduced for comparison with our algorithm for testing purposes. The simulation results demonstrate the validity of the developed algorithm. Under the same conditions, the new method is superior to other clustering algorithms with respect to the accuracy of clustering and concept drift detection; it also reduces computational load and memory usage effectively.


Funded by

国家自然科学基金重点项目(61432011,U1435212)

国家自然科学基金(61673249)

山西省青年科技研究基金(201701D221097)


References

[1] Guha S, Meyerson A, Mishra N. Clustering data streams: theory and practice. IEEE Trans Knowl Data Eng, 2003, 15: 515-528 CrossRef Google Scholar

[2] Khalilian M, Mustapha N. Data stream clustering: challenges and issues. In: Proceedings of the 18th International Conference of Engineers and Computer Scientists, Hong Kong, 2010. 546--549. Google Scholar

[3] Silva J A, Faria E R, Barros R C, et al. Data stream clustering: a survey. ACM Comput Surv, 2014, 46: 125--134. Google Scholar

[4] Guha S, Mishra N. Data Stream Management. Berlin: Springer, 2016. 359--366. Google Scholar

[5] Xu R, WunschII D. Survey of clustering algorithms.. IEEE Trans Neural Netw, 2005, 16: 645-678 CrossRef PubMed Google Scholar

[6] Zhang T, Ramakrishnan R, Livny M. BIRCH. SIGMOD Rec, 1996, 25: 103-114 CrossRef Google Scholar

[7] Park N H, Lee W S. Statistical grid-based clustering over data streams. ACM SIGMOD Rec, 2004, 33: 32--37. Google Scholar

[8] Ordonez C. Clustering binary data streams with K-means. In: Proceedings of the 8th ACM Sigmod Workshop on Research Issues in Data Mining and Knowledge Discovery, San Diego, 2003. 12--19. Google Scholar

[9] Rodrigues P P, Gama J, Pedroso J P. Hierarchical clustering of time-series data streams. IEEE Trans Knowl & Data Eng, 2008, 20: 615--627. Google Scholar

[10] Aggarwal C G, Han J, Wang J Y, et al. A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very Large Data Bases, Berlin, 2003. 81--92. Google Scholar

[11] Zhou A, Cao F, Qian W. Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst, 2008, 15: 181-214 CrossRef Google Scholar

[12] Cao F, Ester M, Qian W, et al. Density-based clustering over an evolving data stream with noise. In: Proceedings of the 6th SIAM International Conference on Data Mining, Bethesda, 2006. 328--339. Google Scholar

[13] Kranen P, Assent I, Baldauf C. The ClusTree: indexing micro-clusters for anytime stream mining. Knowl Inf Syst, 2011, 29: 249-272 CrossRef Google Scholar

[14] Zhang X, Furtlehner C, Perez J, et al. Toward autonomic grids: analyzing the job flow with affinity streaming. In: Proceedings of the ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, Paris, 2009. 987--996. Google Scholar

[15] Zhang X, Furtlehner C, Germain-Renaud C. Data Stream Clustering With Affinity Propagation. IEEE Trans Knowl Data Eng, 2014, 26: 1644-1656 CrossRef Google Scholar

[16] Hinkley D V. Inference about the change-point from cumulative sum tests. Biometrika, 1971, 58: 509-523 CrossRef Google Scholar

[17] Goncalves Jr. P M, de Carvalho Santos S G T, Barros R S M. A comparative study on concept drift detectors. Expert Syst Appl, 2014, 41: 8144-8156 CrossRef Google Scholar

[18] Faria E R, Gon?alves I J C R, de Carvalho A C P L F. Novelty detection in data streams. Artif Intell Rev, 2016, 45: 235-269 CrossRef Google Scholar

[19] Chen K, Liu L. HE-Tree: a framework for detecting changes in clustering structure for categorical data streams. VLDB J, 2009, 18: 1241-1260 CrossRef Google Scholar

[20] Cao F, Liang J, Bai L. A Framework for Clustering Categorical Time-Evolving Data. IEEE Trans Fuzzy Syst, 2010, 18: 872-882 CrossRef Google Scholar

[21] Lu N, Zhang G, Lu J. Concept drift detection via competence models. Artificial Intelligence, 2014, 209: 11-28 CrossRef Google Scholar

[22] da Costa F G, Rios R A, de Mello R F. Using dynamical systems tools to detect concept drift in data streams. Expert Syst Appl, 2016, 60: 39-50 CrossRef Google Scholar

[23] Liu D, Wu Y X, Jiang H. FP-ELM: An online sequential learning algorithm for dealing with concept drift. Neurocomputing, 2016, 207: 322-334 CrossRef Google Scholar

[24] Frigui H. Advances in Fuzzy Clustering and its Applications. Hoboken: John Wiley & Sons, 2007. 112--130. Google Scholar

[25] Pal N R, Bezdek J C. On cluster validity for the fuzzy c-means model. IEEE Trans Fuzzy Syst, 1995, 3: 370-379 CrossRef Google Scholar

[26] Li R P, Mukaidono M. A maximum entropy approach to fuzzy clustering. In: Proceedings of the 4th IEEE International Conference on Fuzzy Systems, Yokohama, 1995. 2227--2232. Google Scholar

[27] Ren S J, Wang Y D. A proof of the convergence theorem of maximum-entropy clustering algorithm. Sci China Inf Sci, 2010, 53: 1151-1158 CrossRef Google Scholar

[28] Zhang B, Qin S, Wang W, et al. Data stream clustering based on fuzzy c-mean algorithm and entropy theory. Signal Process, 2015, 126: 111--116. Google Scholar

[29] Yang Y. An evaluation of statistical approaches to text categorization. Inf Retrieval, 1999, 1: 69-90 CrossRef Google Scholar

[30] Chen H, Chen M, Lin S. Catching the trend: a framework for clustering concept-drifting categorical data. IEEE Trans Knowl Data Eng, 2008, 21: 652--665. Google Scholar

  • Figure 1

    The distribution of the synthetic data stream

  • Figure 2

    Proposed algorithm on synthetic dataset: the 4th sliding window. (a) ${{\boldsymbol~S}^1}\cup~{{\boldsymbol~S}^2}\cup~\cdots~\cup~{{\boldsymbol~S}^4}$; (b) clustering in ${{\boldsymbol~S}^4}$

  • Figure 3

    Proposed algorithm on synthetic dataset: the 8th sliding window. (a) ${{\boldsymbol~S}^1}\cup~{{\boldsymbol~S}^2}\cup~\cdots~\cup~{{\boldsymbol~S}^8}$; (b) clustering in ${{\boldsymbol~S}^8}$

  • Figure 4

    Proposed algorithm on synthetic dataset: the 25th sliding window. (a) ${{\boldsymbol~S}^1}\cup~{{\boldsymbol~S}^2}\cup~\cdots~\cup~{{\boldsymbol~S}^{25}}$; (b) clusteringprotectłinebreak in ${{\boldsymbol~S}^{25}}$

  • Figure 5

    Proposed algorithm on synthetic dataset: the 40th sliding window. (a) clustering in ${{\boldsymbol~S}^{40}}$; (b) the overall clustering results

  • Figure 6

    Clustering accuracy on each sliding window for the synthetic data set

  • Figure 7

    Execution times of different clustering algorithms. (a) Execution times of clustering algorithms with different data stream sizes; (b) execution times of clustering algorithms with different cluster numbers

  • Figure 8

    Memory usages of different clustering algorithms. (a) Memory usages of clustering algorithms with different data stream sizes; (b) memory usages of clustering algorithms with different cluster numbers

  • Table 1   The results of different clustering algorithms on the synthetic data set
    Index CluStream StreamKM DenStream ClusTree StrAP Entropy based FCM Our algorithm
    AC 0.8874 0.9085 0.9281 0.9182 0.9122 0.9285 0.9441
    PE 0.8946 0.8924 0.9332 0.9116 0.9053 0.9306 0.9406
    RE 0.9045 0.8938 0.9147 0.9269 0.9164 0.9298 0.9342
  •   

    Algorithm 1 数据流演化聚类算法

    Require:${\boldsymbol~X},~\alpha>0,~\beta>0,~\theta>0$;

    Output:聚类模型$\{~{(~{{W^1},{C^1}}~),( {{W^2},{C^2}}~),~\ldots~,(~{{W^T},{C^T}}~)} \}$, 漂移检测结果$\{~{{M^1},{M^2},~\ldots~,{M^T}}~\}$;

    初始化: 估计数据子集${{\boldsymbol~S}^1}$的初始簇数量 ${k^1}$, 利用式(2)和(3)计算聚类模型 $(~{{W^1},{C^1}}~)$;

    for $p~=~2:T$

    ${k^p}~=~{k^{p{\rm{~-~}}1}}$;通过式(13)和(14)计算$({{W^p},{C^p}})$;

    if $M_{\rm~opt}/{|{\boldsymbol~S}^p|}>~\theta$ then

    重新计算数据子集${{\boldsymbol~S}^p}$的簇数量${k^p}$;通过式(2)和(3)重新计算聚类模型$(~{{W^p},{C^p}}~)$;

    end if

    end for

  • Table 2   The construction of each subset sampled from real-world data
    Category label Subset1 Subset2 Subset3 Subset4 Subset5 Subset6 Subset7 Subset8 Subset9 Subset10
    Class1 200 100 100 400 300 300 500 200 300 300
    Class2 200 100 200 200 300 100 100 100 400 200
    Class3 200 200 300 100 200 300 200 400 100 200
    Class4 200 300 100 100 100 200 100 200 100 100
    Class5 200 300 300 200 100 100 100 100 100 200
  • Table 3   The clustering results of different algorithms on the KDD-CUP99 data set
    The last window Index CluStream StreamKM DenStream ClusTree StrAP Entropy based FCM Our algorithm
    AC 0.8525 0.8461 0.8733 0.8645 0.8856 0.8671 0.9133
    Subset1 PE 0.8328 0.8547 0.8664 0.8629 0.8748 0.8563 0.9241
    RE 0.8263 0.8484 0.8653 0.8454 0.8734 0.8514 0.9177
    AC 0.8247 0.8236 0.8714 0.8475 0.8503 0.8570 0.9033
    Subset2 PE 0.8016 0.8277 0.8553 0.8338 0.8480 0.8322 0.8947
    RE 0.8172 0.8339 0.8628 0.8504 0.8331 0.8438 0.9035
    AC 0.7277 0.7174 0.8326 0.8220 0.8464 0.8346 0.9066
    Subset3 PE 0.7164 0.7452 0.8406 0.8347 0.8206 0.8421 0.9127
    RE 0.7253 0.7283 0.8343 0.8392 0.8311 0.8384 0.8936
    AC 0.8065 0.8159 0.8182 0.8166 0.8330 0.7092 0.9076
    Subset4 PE 0.8272 0.8146 0.8253 0.8243 0.8265 0.6957 0.8918
    RE 0.7943 0.8060 0.8207 0.8307 0.8223 0.7144 0.8960
    AC 0.8166 0.8023 0.8346 0.8217 0.8356 0.8296 0.9012
    Subset5 PE 0.8227 0.8275 0.8105 0.8255 0.8344 0.8376 0.9044
    RE 0.8181 0.8282 0.8331 0.8274 0.8295 0.8339 0.9057
    AC 0.6514 0.6918 0.8013 0.8244 0.8692 0.8135 0.9125
    Subset6 PE 0.7022 0.6742 0.8252 0.8657 0.8506 0.8324 0.9073
    RE 0.6758 0.7065 0.8546 0.8114 0.8342 0.8316 0.9088
    AC 0.8123 0.7973 0.8103 0.8562 0.8801 0.7059 0.8942
    Subset7 PE 0.8246 0.8055 0.8220 0.8666 0.8533 0.6993 0.8975
    RE 0.8060 0.8076 0.8048 0.8445 0.8595 0.6985 0.9011
    AC 0.7024 0.7262 0.8365 0.8571 0.8340 0.8146 0.9023
    Subset8 PE 0.7156 0.7004 0.8254 0.8055 0.8172 0.8243 0.9102
    RE 0.7007 0.6995 0.8332 0.8229 0.8108 0.8212 0.9088
    AC 0.8317 0.8248 0.8766 0.8298 0.8007 0.7135 0.8964
    Subset9 PE 0.8484 0.8127 0.8578 0.8554 0.8284 0.7267 0.9034
    RE 0.8206 0.8557 0.8446 0.8204 0.8216 0.7091 0.8992
    AC 0.8247 0.8236 0.8786 0.8714 0.9035 0.8442 0.9130
    Subset10 PE 0.8016 0.8277 0.8895 0.8662 0.8976 0.8476 0.9226
    RE 0.8172 0.8339 0.8842 0.8685 0.8866 0.8336 0.9114
  • Table 4   The clustering results of different algorithms on the Forest-CoverType data set
    The last window Index CluStream StreamKM DenStream ClusTree StrAP Entropy based FCM Our algorithm
    AC 0.8642 0.8556 0.8703 0.8664 0.8684 0.8572 0.9025
    Subset1 PE 0.8836 0.8541 0.8552 0.8358 0.8729 0.8660 0.9148
    RE 0.8855 0.8632 0.8686 0.8451 0.8653 0.8534 0.9112
    AC 0.8306 0.8381 0.8463 0.8774 0.8805 0.8348 0.8974
    Subset2 PE 0.8577 0.8106 0.8331 0.8426 0.8430 0.8556 0.9025
    RE 0.8742 0.8227 0.8276 0.8514 0.8442 0.8283 0.9011
    AC 0.8246 0.8252 0.8264 0.8106 0.8546 0.8152 0.8940
    Subset3 PE 0.7859 08411 0.8361 0.8368 0.8353 0.8364 0.8962
    RE 0.8114 0.8243 0.8221 0.8005 0.8471 0.8377 0.9077
    AC 0.7556 0.7431 0.8033 0.8332 0.8341 0.7621 0.9034
    Subset4 PE 0.7187 0.7457 0.8216 0.8140 0.8218 0.7559 0.8973
    RE 0.7228 0.7264 0.8113 0.8225 0.8115 0.7415 0.8864
    AC 0.8341 0.8351 0.8678 0.8557 0.8641 0.8253 0.9112
    Subset5 PE 0.8205 0.8146 0.8842 0.8622 0.8850 0.8411 0.8978
    RE 0.8332 0.8334 0.8635 0.8787 0.8746 0.8127 0.8741
    AC 0.8210 0.8314 0.8712 0.8746 0.8841 0.8304 0.9072
    Subset6 PE 0.8452 0.8082 0.8736 0.8663 0.8734 0.8220 0.9115
    RE 0.7966 0.8140 0.8559 0.8702 0.8833 0.7857 0.8996
    AC 0.7159 0.6742 0.7936 0.8354 0.7884 0.7026 0.8831
    Subset7 PE 0.6687 0.7018 0.8014 0.7944 0.8021 0.6775 0.9004
    RE 0.6934 0.7229 0.7993 0.8326 0.7935 0.6812 0.8918
    AC 0.8475 0.8417 0.8438 0.8559 0.8414 0.8361 0.9155
    Subset8 PE 0.8352 0.8345 0.8330 0.8673 0.8378 0.8544 0.9037
    RE 0.8068 0.8441 0.8541 0.8509 0.8558 0.8272 0.8977
    AC 0.7463 0.7229 0.8245 0.7936 0.8267 0.7145 0.9011
    Subset9 PE 0.7170 0.7552 0.8309 0.8177 0.8335 0.7433 0.9008
    RE 0.6524 0.7018 0.8262 0.8006 0.8064 0.7706 0.8944
    AC 0.8046 0.8213 0.8664 0.8265 0.8768 0.8391 0.9140
    Subset10 PE 0.7559 0.8176 0.8413 0.8445 0.8712 0.8464 0.9039
    RE 0.8211 0.8224 0.8476 0.8406 0.8553 0.8017 0.8988
  • Table 5   The clustering results of different algorithms on the data stream sampled from real-world data
    Data set Subset size Index CluStream StreamKM DenStream ClusTree StrAP Entropy based FCM Our algorithm
    AC 0.8133 0.7956 0.8542 0.8401 0.8548 0.8174 0.8846
    1000 PE 0.8250 0.8116 0.8557 0.8336 0.8631 0.8028 0.8933
    RE 0.7143 0.7652 0.8361 0.8475 0.8442 0.7734 0.8641
    AC 0.8362 0.8268 0.8630 0.8362 0.8652 0.8257 0.9013
    KDD-CUP99 3000 PE 0.8445 0.8413 0.8441 0.8573 0.8546 0.8556 0.9120
    RE 0.8233 0.8030 0.8572 0.8442 0.8471 0.8393 0.9181
    AC 0.7819 0.7647 0.8064 0.8774 0.8637 0.8186 0.8974
    5000 PE 0.8176 0.8207 0.8144 0.8746 0.8822 0.8335 0.9063
    RE 0.7484 0.7903 0.8132 0.8821 0.8506 0.7840 0.8848
    AC 0.8362 0.8485 0.8342 0.8475 0.8646 0.8369 0.9042
    1000 PE 0.8501 0.8339 0.8557 0.8395 0.8638 0.8552 0.9117
    RE 0.7214 0.7864 0.8476 0.8477 0.8584 0.8004 0.8985
    AC 0.8551 0.8530 0.8872 0.8662 0.8763 0.8464 0.9133
    Forest-Cover 3000 PE 0.8468 0.8671 0.8805 0.8743 0.8671 0.8516 0.9253
    Type RE 0.7704 0.8269 0.8623 0.8777 0.8796 0.8284 0.9070
    AC 0.8436 0.8224 0.8401 0.8456 0.8862 0.8260 0.9176
    5000 PE 0.8145 0.8345 0.8352 0.8379 0.8817 0.8411 0.9144
    RE 0.7551 0.7988 0.8445 0.8443 0.8635 0.8164 0.9092
  • Table 6   The results of different algorithms for drifting detection on the KDD-CUP99 data stream
    Number of Window size Index CluStream StreamKM DenStream ClusTree StrAP Entropy based Our
    clusters FCM algorithm
    PD 20/62 19/53 25/45 24/50 29/48 21/47 31/44
    1000 RD 20/38 19/38 25/38 24/38 29/38 21/38 31/38
    ED 7.7460 7.2801 5.7446 6.3246 5.2915 6.5574 4.4721
    PD 17/50 15/36 20/41 19/40 20/38 18/33 25/31
    $k$=2 3000 RD 17/25 15/25 20/25 19/25 20/25 18/25 25/25
    ED 6.4031 5.5678 5.0990 5.1962 4.7958 4.6904 2.4495
    PD 12/35 13/33 14/28 14/29 14/32 13/28 18/26
    5000 RD 12/18 13/18 14/18 13/18 14/18 13/18 18/18
    ED 5.3852 5.0000 4.2426 4.4721 4.6904 4.4721 2.6458
    PD 18/55 16/50 26/51 20/48 19/45 18/44 30/39
    1000 RD 18/38 16/38 26/38 20/38 19/38 18/38 30/38
    ED 7.5498 7.4833 6.0828 6.7823 6.7082 6.7823 4.1231
    PD 16/45 15/38 17/32 16/40 20/39 16/30 23/29
    $k$=5 3000 RD 16/25 15/25 17/25 16/25 20/25 16/25 23/25
    ED 6.1644 5.7446 4.7958 4.7958 4.8990 4.7958 3.0000
    PD 11/33 12/32 11/24 13/24 12/27 13/26 18/20
    5000 RD 11/18 12/18 11/18 13/18 12/18 13/18 18/18
    ED 5.3852 5.0990 4.4721 4.0000 4.5826 4.2426 1.4142
    PD 18/52 15/48 22/46 20/50 18/47 20/43 30/37
    1000 RD 18/38 15/38 22/38 20/38 18/38 20/38 30/38
    ED 7.3485 7.4833 6.2450 6.9282 7.0000 6.4031 2.8284
    PD 16/42 13/35 18/37 16/35 20/32 15/30 21/29
    $k$=10 3000 RD 16/25 13/25 18/25 16/25 20/25 15/25 21/25
    ED 5.9161 4.9000 5.0990 5.2915 4.1231 5.0000 3.4641
    PD 11/27 11/26 12/25 12/26 14/27 11/25 17/20
    5000 RD 11/18 11/18 12/18 12/18 14/18 11/18 17/18
    ED 4.7958 4.6904 4.3589 4.4721 4.1231 4.5826. 2.0000
    PD 16/53 15/43 18/47 16/44 20/46 18/36 29/35
    1000 RD 16/38 15/38 18/38 16/38 20/38 18/38 29/38
    ED 7.6811 7.1414 7.0000 7.0711 6.6332 6.1644 3.8730
    PD 12/48 14/44 15/36 17/35 18/38 16/33 23/27
    $k$=20 3000 RD 12/25 14/25 15/25 17/25 18/25 16/25 23/25
    ED 7.0000 6.4031 5.5678 5.0990 5.1962 5.0990 2.4495
    PD 13/45 13/38 14/26 13/30 14/24 14/27 16/20
    5000 RD 13/18 13/18 14/18 13/18 14/18 14/18 16/18
    ED 6.0828 5.4772 4.0000 4.6904 3.7417 4.1231 2.4495

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1