logo

SCIENTIA SINICA Informationis, Volume 50 , Issue 6 : 824-844(2020) https://doi.org/10.1360/SSI-2020-0009

Association mining method based on neighborhood perspective

More info
  • ReceivedJan 10, 2020
  • AcceptedApr 22, 2020
  • PublishedJun 8, 2020

Abstract

Important tasks in big data association mining are identification of potentially complex associations among massive variables and determination of the strength of different forms of associations. However, uncertain data distributions and diverse associations make it difficult to ensure the applicability and accuracy of measures based on distribution assumptions and data-driven non-parametric measurement methods. Therefore, an effective association measure that is unbiased relative to relationship types is urgently needed. In this article, starting from the fair ordering requirement of potential relationships in big data, we review the current axiomatic conditions of association metrics, provide some possible properties that association measures in big data should satisfy, discuss some limitations of two types of association methods based on neighborhood perspective, and propose a new association measure based on $k$-NN granule, which we refer to as maximum neighborhood coefficient. Experiments using artificial and real datasets verify the effectiveness and superiority of the proposed method from different perspectives. Finally, we identify interesting phenomena in the experiment and theoretical issues to be solved that we hope will motivate deeper thinking and research in this field.


Funded by

国家重点研发计划(2018YFB1004300)

国家自然科学基金(61672332,61872226)

山西省重点研发计划(国际科技合作)(201903D421003)

山西省海外归国人员研究项目(2017023)

山西省自然科学基金计划资助项目(201701D121052)

山西省高等学校科技创新项目(201802013)


References

[1] Young A I, Benonisdottir S, Przeworski M. Deconstructing the sources of genotype-phenotype associations in humans.. Science, 2019, 365: 1396-1400 CrossRef PubMed Google Scholar

[2] Fan J, Liu H. Statistical analysis of big data on pharmacogenomics.. Adv Drug Deliver Rev, 2013, 65: 987-1000 CrossRef PubMed Google Scholar

[3] Fang Z, Fan X, Chen G. A study on specialist or special disease clinics based on big data.. Front Med, 2014, 8: 376-381 CrossRef PubMed Google Scholar

[4] Fan J, Han F, Liu H. Challenges of Big Data Analysis.. Natl Sci Rev, 2014, 1: 293-314 CrossRef PubMed Google Scholar

[5] Speed T. Mathematics. A correlation for the 21st century.. Science, 2011, 334: 1502-1503 CrossRef PubMed Google Scholar

[6] Gao S, ver Steeg G, Galstyan A. Efficient estimation of mutual information for strongly dependent variables. In: Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, 2015. 277--286. Google Scholar

[7] Galton F. Co-relations and their measurement, chiefly from anthropometric data. Proceedings of the Royal Society of London, 1888, 45: 135--145. Google Scholar

[8] Spearman C. The Proof and Measurement of Association between Two Things. Am J Psychology, 1987, 100: 441-471 CrossRef Google Scholar

[9] Ward A. Spurious correlations and causal inferences. Erkenntnis, 2013, 78: 699--712. Google Scholar

[10] Reshef D N, Reshef Y A, Finucane H K. Detecting novel associations in large data sets.. Science, 2011, 334: 1518-1524 CrossRef PubMed Google Scholar

[11] Picornell A C, Echavarria Diaz-Guardamino I, Alvarez Castillo E L. 186PBreast cancer PAM50 subtypes: Correlation between RNA-Seq and multiplexed gene expression platforms. Ann Oncology, 2017, 28 CrossRef Google Scholar

[12] Kinney J B, Atwal G S. Equitability, mutual information, and the maximal information coefficient.. Proc Natl Acad Sci USA, 2014, 111: 3354-3359 CrossRef PubMed Google Scholar

[13] Kraskov A, St?gbauer H, Grassberger P. Estimating mutual information.. Phys Rev E, 2004, 69: 066138 CrossRef PubMed Google Scholar

[14] Rényi, A. On measures of dependence. Acta Mathematica Academiae Scientiarum Hungarica, 1959, 10: 441--451. Google Scholar

[15] Bargiela A, Pedrycz W. Granular computing. In: Handbook on Commputational Intelligence: Fuzzy Logic, Systems, Artificial Neural Networks, and Learning Systems, 2016. 43--66. Google Scholar

[16] Qian Y H, Cheng H H, et al . Grouping granular structures in human granulation intelligence. Information Sciences 2017, 382: 150--169. Google Scholar

[17] Li D Y, Qian Y H, Liang J Y. 大数据挖掘的粒计算理论与方法. Sci Sin-Inf, 2015, 45: 1355-1369 CrossRef Google Scholar

[18] Hu Q, Zhang L, Zhang D. Measuring relevance between discrete and continuous features based on neighborhood mutual information. Expert Syst Appl, 2011, 38: 10737-10750 CrossRef Google Scholar

[19] Schweizer B, Wolff E F. On Nonparametric Measures of Dependence for Random Variables. Ann Statist, 1981, 9: 879-885 CrossRef Google Scholar

[20] Singh H, Misra N, Hnizdo V, et al. Nearest neighbor estimates of entropy. American Journal of Mathematical and Management Sciences, 2003, 23: 301--321. Google Scholar

[21] P$\acute{a}$l D, Póczos B, Szepesv$\acute{a}$ri C. Estimation of Rényi entropy and mutual information based on generalized nearest-neighbor graphs. In: Proceedings of Advances in Neural Information Processing Systems, 2010. 1849--1857. Google Scholar

[22] Xu Y, Qiu P, Roysam B. Unsupervised Discovery of Subspace Trends.. IEEE Trans Pattern Anal Mach Intell, 2015, 37: 2131-2145 CrossRef PubMed Google Scholar

[23] Hu Q H, Yu R D. Applied rough calculation. 2ed. Beijing: Science Press, 2012. Google Scholar

[24] Reshef Y A, Reshef D N, Finucane H K, et al. Measuring dependence powerfully and equitably. The Journal of Machine Learning Research, 2016, 17: 7406--7468. Google Scholar

[25] Reshef D N, Reshef Y A, Mitzenmacher M. Cleaning up the record on the maximal information coefficient and equitability.. Proc Natl Acad Sci USA, 2014, 111: E3362-E3363 CrossRef PubMed Google Scholar

[26] Murrell B, Murrell D, Murrell H. R2-equitability is satisfiable.. Proc Natl Acad Sci USA, 2014, 111: E2160-E2160 CrossRef PubMed Google Scholar

[27] Székely G J, Rizzo M L, Bakirov N K. Measuring and testing dependence by correlation of distances. Ann Statist, 2007, 35: 2769-2794 CrossRef Google Scholar

[28] Székely G J, Rizzo M L. On Brownian Distance Covariance and High Dimensional Data.. Ann Appl Stat, 2009, 3: 1236-1265 CrossRef PubMed Google Scholar

[29] Tsanas A, Xifara A. Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. Energy Buildings, 2012, 49: 560-567 CrossRef Google Scholar

  • Figure 1

    (Color online) Empirical performance of different parameters on independent data. (a) d$X=1$, d$Y=1$, $\alpha=0.5$; (b) d$X=5$, d$Y=3$, $\alpha=0.5$; (c) d$X=5$, d$Y=5$, $\alpha=0.5$; (d) d$X=5$, d$Y=10$, $\alpha=0.5$; (e) d$X=1$, d$Y=1$, $\alpha=0.8$; (f) d$X=5$, d$Y=3$, $\alpha=0.8$; (g) d$X=5$, d$Y=5$, $\alpha=0.8$; (h) d$X=5$, d$Y=10$, $\alpha=0.8$

  • Figure 2

    Functions $f$ used to analyze the equitability of MNC and colored with descending monotonicity

  • Figure 3

    Performance of all comparison measures on $R^{2}$-equitability. (a)–(d) MI$_{\rm~KSG}$ performance; (e)–(h) MI$_{\rm~LNC}$ performance; (i)–(l) MI$_{\rm~GNN}$ performance;(m)–(p) MS performance; (q)–(s) dCor, MIC and MNC performance respectively; (t) functional marks and their monotonicity value

  • Figure 4

    Performance of all comparison measures on Self-equitability. (a)–(d) MI$_{\rm~KSG}$ performance; (e)–(h) NS$_{\rm~LNC}$ performance; (i)–(l) MI$_{\rm~GNN}$ performance;(m)–(p) NS performance; (q)–(s) dCor, MIC and MNC performance respectively; (t) functional marks and their monotonicity value

  • Figure 5

    (Color online) Empirical performance of MMC and dCor with respect to three different relationship types as the dimension of variables associated with $Y$ in ${\boldsymbol~X}$ increases. (a) Linear relationship; (b) mixed relationship; (c) nonlinear relationship

  • Figure 6

    (Color online) Five redundant relationship types in ${\boldsymbol~X}$ and empirical performance of MMC and dCor in each case (a) ${\boldsymbol~X}=(X_{1},X_{2}$); (b) ${\boldsymbol~X}=(X_{1},X_{2},X_{1}^{2}$); (c) ${\boldsymbol~X}=(X_{1},X_{2},X_{1}^{2},~X_2^2$); (d) ${\boldsymbol~X}=(X_{1},X_{2},X_{1}^{2},X_{2}^{3})$; (e) ${\boldsymbol~X}=(X_{1},X_{2},X_{1},X_{2})$; (f) experimental result

  • Figure 7

    (Color online) Empirical performance of MNC and dCor with respect to three relationship types as the dimension of independent variables with $Y$ in ${\boldsymbol~X}$ increases. (a) Linear relationship; (b) mixed relationship; (c) nonlinear relationship

  • Figure 8

    (Color online) Demonstration some representative associations of ENB. (a) ($X_{2},Y_{1}$), (b) ($X_{7},Y_{1}$), (c) ($X=(X_{2},~X_{7}),Y_{1}$) by MNC; (d) ($X_{5},Y_{1}$), (e) ($X_{4},Y_{1}$), (f) (${\boldsymbol~X}=(X_{5},~X_{4}),Y_{1}$) by dCor

  • Table 1   Performance of all methods on noiseless functional relationships
    Relationship type Spearman Pearson MIC ${\rm~MI}_{\rm~KSG}$ ${\rm~MI}_{\rm~LNC}$ ${\rm~MI}_{\rm~GNN}$ NS MNC
    Random 0.03 0.03 0.17 0.13 0.13 0.26 0.00 0.21
    Linear 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
    Exponential 1.00 0.87 1.00 1.00 1.00 1.00 0.99 1.00
    Cubic 0.78 0.66 1.00 1.00 1.00 0.99 1.00 1.00
    Linear Periodic 0.31 0.33 1.00 0.74 0.74 0.93 1.00 1.00
    Sin (Fourier frequency) 0.14 $-0.09$ 1.00 0.05 0.05 0.93 0.99 1.00
    Sin (Varying frequency) $-0.11$ $-0.11$ 1.00 0.04 0.04 0.98 0.99 1.00
    Parabolic $-0.00$ 0.00 1.00 1.00 1.00 1.00 1.00 1.00
    Sin (nonFourier frequency) 0.00 0.00 1.00 0.38 0.38 0.97 0.80 1.00
  • Table 2   Different measures to compute the associations strength of pairwise variables
    Xvar Yvar MNC Spearman dCor MIC MNC-$\rho^2$ MIC-$\rho^2$ $\rho$
    $X_{1}$ $X_{2}$ 1 $-$1 1 1 0.02 0.02 $-$0.99
    $X_{1}$ $X_{3}$ 1 $-$0.26 0.45 1 0.96 0.95 $-$0.20
    $X_{1}$ $X_{4}$ 1 $-$0.87 0.88 1 0.25 0.25 $-$0.87
    $X_{1}$ $X_{5}$ 1 0.87 0.86 1 0.31 0.31 0.83
    $X_{2}$ $X_{3}$ 1 0.26 0.45 0.99 0.96 0.95 0.20
    $X_{2}$ $X_{4}$ 1 0.87 0.89 1 0.22 0.22 0.88
    $X_{2}$ $X_{5}$ 1 $-$0.87 0.89 1 0.26 0.26 $-$0.86
    $X_{4}$ $X_{5}$ 1 $-$0.94 0.99 1 0.05 0.05 $-$0.97
    $X_{5}$ $X_{6}$ 0.79 0 0 0 0.79 0 0
    $X_{3}$ $X_{5}$ 0.78 0.22 0.31 0.37 0.71 0.30 0.28
    $X_{3}$ $X_{4}$ 0.72 $-$0.19 0.34 0.39 0.63 0.30 $-$0.29
    $X_{4}$ $X_{6}$ 0.66 0 0 0 0.66 0 0
    $X_{1}$ $X_{6}$ 0.58 0 0 0 0.58 0 0
    $X_{2}$ $X_{6}$ 0.58 0 0 0 0.58 0 0
    $X_{3}$ $X_{6}$ 0.56 0 0 0 0.56 0 0
    $X_{5}$ $X_{8}$ 0.5 0 0 0 0.5 0 0
    $X_{3}$ $X_{8}$ 0.42 0 0 0 0.42 0 0
    $X_{6}$ $X_{8}$ 0.40 0 0 0 0.40 0 0
    $X_{7}$ $X_{8}$ 0.38 0.19 0.21 0.34 0.33 0.29 0.21
    $X_{3}$ $X_{7}$ 0.36 0 0 0 0.36 0 0
    $X_{5}$ $X_{7}$ 0.34 0 0 0 0.34 0 0
    $X_{4}$ $X_{8}$ 0.25 0 0 0 0.25 0 0
    $X_{4}$ $X_{7}$ 0.25 0 0 0 0.25 0 0
    $X_{1}$ $X_{8}$ 0.25 0 0 0 0.25 0 0
    $X_{2}$ $X_{8}$ 0.25 0 0 0 0.25 0 0
    $X_{6}$ $X_{7}$ 0.25 0 0 0 0.25 0 0
    $X_{1}$ $X_{7}$ 0.23 0 0 0 0.23 0 0
    $X_{2}$ $X_{7}$ 0.23 0 0 0 0.23 0 0
  • Table 3   Associations of 8 variables against heating load
    Xvar Yvar MNC MIC Spearman dCor MIC-$\rho^2$ MNC-$\rho^2$
    $X_{1}$ $Y_{1}$ 0.81 1 0.62 0.76 0.61 0.43
    $X_{2}$ $Y_{1}$ 0.81 1 $-$0.62 0.78 0.57 0.38
    $X_{3}$ $Y_{1}$ 0.72 0.67 0.47 0.43 0.46 0.51
    $X_{4}$ $Y_{1}$ 0.66 1 $-$0.80 0.91 0.26 $-$0.09
    $X_{7}$ $Y_{1}$ 0.65 0.68 0.32 0.25 0.60 0.57
    $X_{5}$ $Y_{1}$ 0.51 1 0.86 0.92 0.21 $-$0.28
    $X_{8}$ $Y_{1}$ 0.45 0.26 0.07 0.09 0.25 0.44
    $X_{6}$ $Y_{1}$ 0.39 0.14 0 0.01 0.14 0.39
  • Table 4   Top 5 associations ranked by MNC and dCor on different combined variables
    X Y MNC X Y dCor
    $(X_{2},X_{7})$ $Y_{1}$ 0.94 $(X_{4},X_{5})$ $Y_{1}$ 0.92
    $(X_{1},X_{7})$ $Y_{1}$ 0.94 $(X_{1},X_{5})$ $Y_{1}$ 0.91
    $(X_{4},X_{7})$ $Y_{1}$ 0.86 $(X_{2},X_{5})$ $Y_{1}$ 0.91
    $(X_{5},X_{7})$ $Y_{1}$ 0.84 $(X_{3},X_{5})$ $Y_{1}$ 0.91
    $(X_{3},X_{4})$ $Y_{1}$ 0.84 $(X_{2},X_{4})$ $Y_{1}$ 0.89
    $(X_{2},X_{5},X_{7})$ $Y_{1}$ 0.94 $(X_{3},X_{4},X_{5})$ $Y_{1}$ 0.92
    $(X_{1},X_{5},X_{7})$ $Y_{1}$ 0.94 $(X_{1},X_{4},X_{5})$ $Y_{1}$ 0.91
    $(X_{2},X_{4},X_{7})$ $Y_{1}$ 0.94 $(X_{2},X_{4},X_{5})$ $Y_{1}$ 0.91
    $(X_{1},X_{4},X_{7})$ $Y_{1}$ 0.94 $(X_{4},X_{5},X_{7})$ $Y_{1}$ 0.91
    $(X_{3},X_{4},X_{7})$ $Y_{1}$ 0.94 $(X_{4},X_{5},X_{8})$ $Y_{1}$ 0.90
    $(X_{2},X_{4},X_{5},X_{7})$ $Y_{1}$ 0.94 $(X_{3},X_{4},X_{5},X_{7})$ $Y_{1}$ 0.91
    $(X_{3},X_{4},X_{5},X_{7})$ $Y_{1}$ 0.94 $(X_{2},X_{4},X_{5},X_{7})$ $Y_{1}$ 0.91
    $(X_{1},X_{4},X_{5},X_{7})$ $Y_{1}$ 0.94 $(X_{1},X_{4},X_{5},X_{7})$ $Y_{1}$ 0.91
    $(X_{2},X_{3},X_{5},X_{7})$ $Y_{1}$ 0.94 $(X_{2},X_{3},X_{4},X_{5})$ $Y_{1}$ 0.91
    $(X_{1},X_{3},X_{5},X_{7}$) $Y_{1}$ 0.94 $(X_{1},X_{3},X_{4},X_{5})$ $Y_{1}$ 0.91
    $(X_{2},X_{3},X_{4},X_{5},X_{7})$ $Y_{1}$ 0.94 $(X_{2},X_{3},X_{4},X_{5},X_{7})$ $Y_{1}$ 0.91
    $(X_{1},X_{3},X_{4},X_{5},X_{7})$ $Y_{1}$ 0.94 $(X_{1},X_{3},X_{4},X_{5},X_{7})$ $Y_{1}$ 0.91
    $(X_{1},X_{2},X_{4},X_{5},X_{7})$ $Y_{1}$ 0.94 $(X_{3},X_{4},X_{5},X_{7},X_{8})$ $Y_{1}$ 0.90
    $(X_{1},X_{2},X_{3},X_{5},X_{7})$ $Y_{1}$ 0.94 $(X_{1},X_{2},X_{4},X_{5},X_{7})$ $Y_{1}$ 0.90
    $(X_{1},X_{2},X_{3},X_{4},X_{7})$ $Y_{1}$ 0.94 $(X_{2},X_{3},X_{4},X_{5},X_{8})$ $Y_{1}$ 0.90
    $(X_{1},X_{2}$,$X_{3},X_{4},X_{5},X_{7})$ $Y_{1}$ 0.94 $(X_{1},X_{2},X_{3},X_{4},X_{5},X_{7})$ $Y_{1}$ 0.90
    $(X_{2},X_{3},X_{4},X_{5},X_{7},X_{8})$ $Y_{1}$ 0.88 $(X_{2},X_{3},X_{4},X_{5},X_{7},X_{8})$ $Y_{1}$ 0.90
    $(X_{1},X_{3},X_{4},X_{5},X_{7},X_{8})$ $Y_{1}$ 0.88 $(X_{1},X_{3},X_{4},X_{5},X_{7},X_{8})$ $Y_{1}$ 0.90
    $(X_{1},X_{2},X_{4},X_{5},X_{7},X_{8})$ $Y_{1}$ 0.88 $(X_{1},X_{2},X_{4},X_{5},X_{7},X_{8})$ $Y_{1}$ 0.89
    $(X_{1},X_{2},X_{3},X_{5},X_{7},X_{8})$ $Y_{1}$ 0.88 $(X_{1},X_{2},X_{3},X_{5},X_{7},X_{8})$ $Y_{1}$ 0.89cr
    $(X_{1},X_{2},X_{3},X_{4},X_{5},X_{7},X_{8})$ $Y_{1}$ 0.88 $(X_{1},X_{2},X_{3},X_{4},X_{5},X_{7},X_{8})$ $Y_{1}$ 0.89
    $(X_{1},X_{2},X_{3},X_{4},X_{5},X_{6},X_{7})$ $Y_{1}$ 0.84 $(X_{1},X_{2},X_{3},X_{4},X_{5},X_{6},X_{7})$ $Y_{1}$ 0.88
    $(X_{1},X_{3},X_{4},X_{5},X_{6},X_{7},X_{8})$ $Y_{1}$ 0.74 $(X_{2},X_{3},X_{4},X_{5},X_{6},X_{7},X_{8})$ $Y_{1}$ 0.88
    $(X_{1},X_{2},X_{3},X_{4},X_{6},X_{7},X_{8})$ $Y_{1}$ 0.74 $(X_{1},X_{3},X_{4},X_{5},X_{6},X_{7},X_{8})$ $Y_{1}$ 0.88
    $(X_{2},X_{3},X_{4},X_{5},X_{6},X_{7},X_{8})$ $Y_{1}$ 0.73 $(X_{1},X_{2},X_{4},X_{5},X_{6},X_{7},X_{8})$ $Y_{1}$ 0.88
    $(X_{1},X_{2},X_{3},X_{4},X_{5},X_{6},X_{7},X_{8})$ $Y_{1}$ 0.74 $(X_{1},X_{2},X_{3},X_{4},X_{5},X_{6},X_{7},X_{8})$ $Y_{1}$ 0.88

Copyright 2020  CHINA SCIENCE PUBLISHING & MEDIA LTD.  中国科技出版传媒股份有限公司  版权所有

京ICP备14028887号-23       京公网安备11010102003388号