logo

SCIENTIA SINICA Informationis, Volume 49, Issue 12: 1572-1585(2019) https://doi.org/10.1360/SSI-2019-0117

Maximum average entropy-rate based correlation clustering for big data

More info
  • ReceivedJun 4, 2019
  • AcceptedJul 10, 2019
  • PublishedDec 16, 2019

Abstract

Clustering is one of fundamental tasks of data mining and machine learning. Due to the limitation of cluster assumption, lots of clustering algorithms perform poorly on some datasets against their assumptions, especially high-dimensional big data. This paper presents a maximum average entropy-rate based correlation clustering algorithm which is a kind of a graph-based correlation clustering. The objective function of original correlation clustering is decomposed into several single cluster optimizations and the limitation of big data in correlation clustering is removed by the neighboring connected graph. In algorithm implementation, the optimization of proper neighbor searching and correlation clustering are performed by heuristic neighbor searching and cluster generating respectively, and there is also an efficient graph-iterated implementation on distributed computation platform. Compared with other clustering algorithms, the proposed clustering algorithm is more flexible in cluster assumption, when accelerating the clustering process. In an experimental study we demonstrate the performance of the proposed algorithms on several datasets. The proposed clustering algorithm performed better than the other six clustering algorithms on the highest f1-measure and purity values, while its running time on high-dimensional big data is much lower than other clustering algorithms.


Funded by

国家自然科学基金(61772410,61802298,11690011,U1811461)

国家重点研发计划(2017YFB1010004)


References

[1] Kolosnjaji B, Zarras A, Webster G, et al. Deep learning for classification of malware system call sequences. In: Proceedings of Australasian Joint Conference on Artificial Intelligence. Berlin: Springer, 2016. 137--149. Google Scholar

[2] Bühlmann P, Rütimann P, van de Geer S. Correlated variables in regression: Clustering and sparse estimation. J Statistical Planning Inference, 2013, 143: 1835-1858 CrossRef Google Scholar

[3] He Z, Liu H, Wang Y W. Generative Adversarial Networks-Based Semi-Supervised Learning for Hyperspectral Image Classification. Remote Sens, 2017, 9: 1042 CrossRef ADS Google Scholar

[4] Kanungo T, Mount D M, Netanyahu N S. An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Machine Intell, 2002, 24: 881-892 CrossRef Google Scholar

[5] Comaniciu D, Meer P. Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Machine Intell, 2002, 24: 603-619 CrossRef Google Scholar

[6] Estrada F, Jepson A, Chennubhotla C, et al. Spectral embedding and min cut for image segmentation. In: Proceedings of BMVC, Kingston, 2004. 1--10. Google Scholar

[7] Shi J B, Malik J. Normalized cuts and image segmentation. IEEE Trans Pattern Anal Machine Intell, 2000, 22: 888-905 CrossRef Google Scholar

[8] Bansal N, Blum A, Chawla S. Correlation Clustering. Machine Learning, 2004, 56: 89-113 CrossRef Google Scholar

[9] Felzenszwalb P F, Huttenlocher D P. Efficient Graph-Based Image Segmentation. Int J Comput Vision, 2004, 59: 167-181 CrossRef Google Scholar

[10] Mohebi A, Aghabozorgi S, Wah T Y. Iterative big data clustering algorithms: a review. Softw Pract Exper, 2016, 46: 107-129 CrossRef Google Scholar

[11] Zhao W Z, Ma H F, He Q. Parallel k-means clustering based on mapreduce. In: Proceedings of IEEE International Conference on Cloud Computing. Berlin: Springer, 2009. 674--679. Google Scholar

[12] Bu Y Y, Howe B, Balazinska M. HaLoop. Proc VLDB Endow, 2010, 3: 285-296 CrossRef Google Scholar

[13] Zaharia M, Chowdhury M, Das T, et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of NSDI, San Jose, 2012. 2. Google Scholar

[14] He Y B, Tan H Y, Luo W M, et al. Mr-dbscan: an efficient parallel density-based clustering algorithm using mapreduce. In: Proceedings of ICPADS, Tainan, 2011. 473--480. Google Scholar

[15] Chen W Y, Song Y Q, Bai H J. Parallel spectral clustering in distributed systems.. IEEE Trans Pattern Anal Mach Intell, 2011, 33: 568-586 CrossRef PubMed Google Scholar

[16] Chopra S, Rao M R. The partition problem. Math Programming, 1993, 59: 87-115 CrossRef Google Scholar

[17] Nowozin S, Jegelka S. Solution stability in linear programming relaxations: graph partitioning and unsupervised learning. In: Proceedings of ICML, Montreal, 2009. 769--776. Google Scholar

[18] Vitaladevuni S N, Basri R. Co-clustering of image segments using convex optimization applied to EM neuronal reconstruction. In: Proceedings of CVPR, San Francisco, 2010. 2203--2210. Google Scholar

[19] Tsochantaridis I, Joachims T, Hofmann T, et al. Large margin methods for structured and interdependent output variables. J Mach Learn Res, 2005, 6: 1453--1484. Google Scholar

[20] Bagon S, Galun M. Large scale correlation clustering optimization. 2011,. arXiv Google Scholar

[21] Chierichetti F, Dalvi N, Kumar R. Correlation clustering in mapreduce. In: Proceedings of KDD, New York, 2014. 641--650. Google Scholar

[22] Blondel V D, Guillaume J L, Lambiotte R. Fast unfolding of communities in large networks. J Stat Mech, 2008, 2008(10): P10008 CrossRef ADS arXiv Google Scholar

[23] Liu M Y, Tuzel O, Ramalingam S. Entropy-rate clustering: cluster analysis via maximizing a submodular function subject to a matroid constraint.. IEEE Trans Pattern Anal Mach Intell, 2014, 36: 99-112 CrossRef PubMed Google Scholar

[24] Rodriguez A, Laio A. Clustering by fast search and find of density peaks. Science, 2014, 344: 1492-1496 CrossRef PubMed ADS Google Scholar

[25] Frey B J, Dueck D. Clustering by Passing Messages Between Data Points. Science, 2007, 315: 972-976 CrossRef PubMed ADS Google Scholar

[26] Ng A Y, Jordan M I, Weiss Y. On spectral clustering: analysis and an algorithm. In: Proceedings of NIPS, Vancouver, 2002. 849--856. Google Scholar

[27] Bouguettaya A, Yu Q, Liu X M. Efficient agglomerative hierarchical clustering. Expert Syst Appl, 2015, 42: 2785-2797 CrossRef Google Scholar

[28] Franti P, Sieranoja S. K-means properties on six clustering benchmark datasets. Appl Intell, 2018, 48: 4743-4759 CrossRef Google Scholar

[29] King D E. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 2009, 10(Jul): 1755--1758 DOI: 10.1145/1577069.1755843. Google Scholar

[30] Deng J K, Guo J, Xue N N, et al. Arcface: additive angular margin loss for deep face recognition. In: Proceedings of CVPR, Long Beach, 2019. 4690--4699. Google Scholar

  • Figure 1

    (Color online) Illustration of correlation clustering

  • Figure 2

    (Color online) Comparison of neighbors by original $\sigma_{0}$ and optimized $\hat{\sigma}$. Taking points in Aggregation for instance, neighbors by original $\sigma_{0}=10$ are shown in (a) and neighbors by optimized $\hat{\sigma}=0.4$ are shown in (b).

  • Figure 3

    (Color online) Clustering results comparison on standard 2D datasets

  • Figure 4

    (Color online) Comparison on running time of ECC and $k$-means: running time is the average of 3 times running; values of $k$ for $k$-means are $1003,~2119,~3141,~4207,~5181,~6209,~7232,~8252$ and 9306; parameters for ECC are $\sigma_{0}=0.5,{\rm~iter}_{\rm~max}=10,l=1$; parameters for $k$-NN+Louvain are $l=0,k=10$.

  •   

    Algorithm 1 Neighboring by greedy algorithm-${{\rm~nbr}_{\bar{\mathcal{H}}}}(\cdot,\cdot)$

    Require:Node $~i~$, initial edge set of $i$ $\boldsymbol{E}_{i}~=~\{e_{ij}\in\boldsymbol{E}\}$;

    Output:Edge set based on maximum average entropy rate $\mathcal{N}_{i}$;

    $\mathcal{N}_{i}~\Leftarrow~\emptyset$;

    $\boldsymbol{E}'_{i}~\Leftarrow~$Order edges in $\boldsymbol{E}_{i}$ by weight in descending order;

    $\bar{\mathcal{H}}_{\rm~max}~\Leftarrow~0$; % Variable to store the value of maximum average entropy rate.

    for Edge $e_{ij}$ in $\boldsymbol{E}'_{i}$

    $\bar{\mathcal{H}}_{\rm~temp}~\Leftarrow\bar{\mathcal{H}}(\mathcal{N}_{i})$; % Average entropy rate of current edge set.

    if $\bar{\mathcal{H}}_{\rm~temp}\ge\bar{\mathcal{H}}_{\rm~max}$ then

    Add $e_{ij}$ to $\mathcal{N}_{i}$;

    $\bar{\mathcal{H}}_{\rm~max}~\Leftarrow~\bar{\mathcal{H}}_{\rm~temp}$;

    else

    Return $\mathcal{N}_{i}$;

    end if

    end for

  • Table 1   Clustering metrics comparison on standard 2D datasets, including f1-measure and purity
    Datasets Metrics ECC $k$-means[4] AP[25] Mean-shift[5] Spectral[26] HAC[27] Ncut[7]
    2*Flame $F1$ 1.0000 0.8406 0.8243 0.8528 0.9832 0.7732 0.9917
    $P$ 1.0000 0.8647 0.8488 0.8586 0.8787 0.8595 0.9917
    2*Spiral $F1$ 1.0000 0.3502 0.3225 0.3562 1.0000 0.3917 0.5823
    $P$ 1.0000 0.3447 0.3566 0.3475 1.0000 0.3337 0.5992
    2*Aggregation $F1$ 1.0000 0.8504 0.8618 0.9595 0.7491 0.9097 0.9925
    $P$ 1.0000 0.9453 0.9415 0.9685 0.8242 0.9646 0.9930
    2*Face $F1$ 0.9669 0.8637 0.8867 0.6883 0.3478 0.9605 0.1644
    $P$ 0.9727 0.7134 0.7583 0.5445 0.1738 0.8335 0.1952
  •   

    Algorithm 2 Correlation clustering on single cluster-${{\rm~Cc_{s}}}(\cdot,~\cdot)$

    Require:Graph $\boldsymbol{G}(\boldsymbol{V},\boldsymbol{E})$: $v\in~\boldsymbol{V},~v.{\text~{cluster~=~`unset'}},~e\in\boldsymbol{E},~e.{\rm~label}=-$, maximum iteration ${\rm~iter}_{\rm~max}$;

    Output:Node set of cluster $\boldsymbol{c}$;

    $\boldsymbol{V}_{\rm~temp}\Leftarrow$ nodes from $\boldsymbol{V}$ with .cluster = `unset'; % Unclustered node set.

    $i~\Leftarrow~$ node from $\boldsymbol{V}_{\rm~temp}$ with maximum degree;

    Set edge in $\boldsymbol{E}_{i}$ .label = +;

    ${\rm~iter}\Leftarrow0$;

    while ${\rm~iter<~iter}_{\rm~max}$ do

    for each $v$ in $\boldsymbol{V}$

    if ${\boldsymbol~f}^{+}(v)\ge{\boldsymbol~f}^{-}(v)$ then

    Set $v.{\rm~cluster}=i$, set edge label in $\boldsymbol{E}_{v}$ to $+$; % $v$ is contained by cluster ${\boldsymbol~c}$.

    else

    Set $v.{\rm~cluster=`unset}$', set edge label in $\boldsymbol{E}_{v}$ to $-$; % $v$ is not contained by cluster ${\boldsymbol~c}$.

    end if

    end for

    ${\rm~iter}\Leftarrow~{\rm~iter}+1$;

    end while

    $\boldsymbol{c}\Leftarrow$ nodes from $\boldsymbol{V}$ with .cluster = $i$;

    Return $\boldsymbol{c}$.

  •   

    Algorithm 3 Correlation clustering on multiple clusters-${\rm~Cc_{m}}(\cdot,~\cdot,~\cdot)$

    Require:Dataset $\boldsymbol{X}$, initial neighbor parameter $\sigma_{0}$, maximum iteration ${\rm~iter}_{\rm~max}$;

    Output:Clustering result $\boldsymbol{C}$;

    $\boldsymbol{E}\Leftarrow~\emptyset$;

    for each $x$ in $\boldsymbol{X}$

    $\boldsymbol{E}_{x}\Leftarrow\mathcal{N}_{\sigma_{0}}(x)$; % $\mathcal{N}_{\sigma_{0}}(\cdot)$ return initial neighbor with parameter $\sigma_{0}$.

    $\mathcal{N}_{x}\Leftarrow{{\rm~nbr}_{\bar{\mathcal{H}}}}(x,\boldsymbol{E}_{x})$;

    Add $\mathcal{N}_{x}$ to ${\boldsymbol~E}$;

    end for

    Generate graph $\boldsymbol{G}(\boldsymbol{V}=\boldsymbol{X},\boldsymbol{E})$;

    Initialize $\boldsymbol{G}$: $v\in~\boldsymbol{V},~v.{\text{cluster~=~`unset&apos;}},~e\in\boldsymbol{E},~e.{\rm~label}=-$;

    $\boldsymbol{C}\Leftarrow\{\}$;

    while $\{v\in~\boldsymbol{V},~v.{\text{cluster=`unset&apos;}}\}\ne~\emptyset$ do

    $\boldsymbol{c}\Leftarrow{\rm~Cc_{s}}(\boldsymbol{G},~{\rm~iter}_{\rm~max})$;

    Add $\boldsymbol{c}$ to $\boldsymbol{C}$;

    Reset $\boldsymbol{G}$: $e\in\boldsymbol{E},~e.{\rm~label}=-$;

    end while

    Return $\boldsymbol{C}$.

  •   

    Algorithm 4 Start points initialization-${\rm~Initspts}(\cdot)$

    Require:Graph $\boldsymbol{G}(\boldsymbol{V},\boldsymbol{E})$: $v\in~\boldsymbol{V},~{\text{v.cluster~=~`unset&apos;}},~e\in\boldsymbol{E},~e.{\text{label~=~`unset&apos;}}$;

    Output:Initialized graph $\boldsymbol{G}(\boldsymbol{V},\boldsymbol{E}&apos;)$;

    $\boldsymbol{E}&apos;\Leftarrow\{\}$;

    for each $e_{ij}$ in $\boldsymbol{E}$

    if $e_{ij}.{\text{label~==~`unset&apos;}}$ then

    $e_{ij}.{\text{label}}\Leftarrow$ node with maximum degree between node $i,j$;

    else

    $e_{ij}.{\rm~label}\Leftarrow$ node with maximum degree among node $i,j$ and $e_{ij}.{\rm~label}$;

    end if

    Add $e_{ij}$ to $\boldsymbol{E}&apos;$;

    end for

    Return $\boldsymbol{G}(\boldsymbol{V},\boldsymbol{E}&apos;)$.

  •   

    Algorithm 5 Clusters generation-${\rm~Gencluster}(\cdot,\cdot,\cdot)$

    Require:Edge initialized graph $\boldsymbol{G}(\boldsymbol{V},\boldsymbol{E})$, overlapping parameter $l$, maximum iteration ${\rm~iter}_{\rm~max}$;

    Output:Node set of clusters $\boldsymbol{C}$;

    Initialize $\boldsymbol{V}$ of $\boldsymbol{G}(\boldsymbol{V},\boldsymbol{E})$: $v\in\boldsymbol{V},~v.{\rm~cluster}\leftarrow~{\text{empty~list~with}}~~l~~{\rm~legnth}$;

    while ${\rm~iter}<{\rm~iter}_{\rm~max}$ do

    for each $v$ in $\boldsymbol{V}$

    $\boldsymbol{K}\Leftarrow$ collection of .label value of edge in $\boldsymbol{E}_{v}$;

    ${\rm~templist}\Leftarrow$ empty list;

    for each $k$ in $\boldsymbol{K}$

    if ${\boldsymbol~f}_{k}(v)\ge\sum\nolimits_{i\ne~k}{\boldsymbol~f}_{i}(v)$ then

    Add $k$ to templist;

    end if

    end for

    $v.{\rm~cluster}\Leftarrow$ Keep the top $l$ items in templist ordered by its force ${\boldsymbol~f}_{k}(v)$;

    end for

    ${\rm~iter}\Leftarrow~{\rm~iter}+1$;

    end while

    $\boldsymbol{C}\Leftarrow$ aggregate nodes by their label;

    Return $\boldsymbol{C}$.

  •   

    Algorithm 6 Correlation clustering on big data-${\rm~Cc_{b}}(\cdot,\cdot,\cdot,\cdot)$

    Require:Dataset $\boldsymbol{X}$, initial neighbor parameter $\sigma_{0}$, maximum iteration ${\rm~iter}_{\rm~max}$, overlapping parameter $l$;

    Output:Node set of clusters $\boldsymbol{C}$;

    $\boldsymbol{E}\Leftarrow\emptyset$;

    for each $x$ in $\boldsymbol{X}$

    $\boldsymbol{E}_{x}\Leftarrow\mathcal{N}_{\sigma_{0}}(x)$;

    $\mathcal{N}_{x}\Leftarrow{{\rm~nbr}_{\bar{\mathcal{H}}}}(x,\boldsymbol{E}_{x})$;

    Add $\mathcal{N}_{x}$ to ${\boldsymbol~E}$;

    end for

    Generate graph $\boldsymbol{G}(\boldsymbol{V}=\boldsymbol{X},\boldsymbol{E})$;

    ${\rm~Initespts}(\boldsymbol{G})$;

    $\boldsymbol{C}\Leftarrow{\rm~Gencluster}(\boldsymbol{G},l,{\rm~iter}_{\rm~max})$;

    Return $\boldsymbol{C}$.

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1