logo

SCIENTIA SINICA Informationis, Volume 49, Issue 9: 1159-1174(2019) https://doi.org/10.1360/N112018-00331

Protein function prediction based on zero-one matrix factorization

More info
  • ReceivedDec 12, 2018
  • AcceptedFeb 28, 2019
  • PublishedSep 4, 2019

Abstract

Accurately annotating the functions of proteins is one of the key tasks of functional genomics. A large portion of functional annotations of proteins is missing, and the functional label space is expansive. Moreover, label compression methods have been proposed and applied to predict protein function; however, such methods lack the interpretability of compressed labels and suffer from the inherent problem of thresholding labels in multi-label learning. To solve these problems, this paper proposes a protein function prediction method based on zero-one matrix factorization (ZOMF). ZOMF first factorizes the protein-function association matrix into two low-rank zero-one matrices and explores the inner latent relationship between proteins and labels. Subsequently, it defines two smoothness terms on these two low-rank matrices with respect to protein–protein interactions and the structural relationships between labels to guide the optimization of low-rank matrices. Finally, to predict protein function, it reconstructs the association matrix using the optimized two low-rank matrices. Experimental results on four model species (yeast, Arabidopsis, mouse, and human) show that ZOMF can predict protein functions more accurately than existing algorithms. However, it does not need to threshold the reconstructed matrix, and the compressed zero-one labels have more than one intuitive explanation.


Funded by

国家自然科学基金(61872300,61741217,61873214,61871020)

国家重点研发计划(2016YFC0901902)

中央高校基本科研业务(XDJK2019B024)

重庆市基础与前沿研究(cstc2018jcyjAX0228,cstc2016jcyjA0351)


References

[1] Radivojac P, Clark W T, Oron T R. A large-scale evaluation of computational protein function prediction.. Nat Methods, 2013, 10: 221-227 CrossRef PubMed Google Scholar

[2] Vazquez A, Flammini A, Maritan A. Global protein function prediction from protein-protein interaction networks.. Nat Biotechnol, 2003, 21: 697-700 CrossRef PubMed Google Scholar

[3] Shehu A, Barbara D, Molloy K. A survey of computational methods for protein function prediction. In: Big Data Analytics in Genomics. Berlin: Springer, 2016. 225--298. Google Scholar

[4] Berardini T Z, Khodiyar V K, Lovering R C. The Gene Ontology in 2010: extensions and refinements.. Nucleic Acids Res, 2010, 38: D331-D335 CrossRef PubMed Google Scholar

[5] Schnoes A M, Ream D C, Thorman A W. Biases in the Experimental Annotations of Protein Function and Their Effect on Our Understanding of Protein Function Space. PLoS Comput Biol, 2013, 9: e1003063 CrossRef PubMed ADS arXiv Google Scholar

[6] Legrain P, Aebersold R, Archakov A. The human proteome project: Current state and future direction.. Mol Cellular Proteomics, 2011, CrossRef PubMed Google Scholar

[7] Valentini G. True path rule hierarchical ensembles for genome-wide gene function prediction.. IEEE/ACM Trans Comput Biol Bioinf, 2011, 8: 832-847 CrossRef PubMed Google Scholar

[8] Fu G Y, Wang J, Yang B. NegGOA: negative GO annotations selection using ontology structure.. Bioinformatics, 2016, 32: 2996-3004 CrossRef PubMed Google Scholar

[9] Wu J S, Huang S J, Zhou Z H. Genome-Wide Protein Function Prediction through Multi-Instance Multi-Label Learning.. IEEE/ACM Trans Comput Biol Bioinf, 2014, 11: 891-902 CrossRef PubMed Google Scholar

[10] Wang H, Huang H, Ding C. Function-function correlated multi-label protein function prediction over interaction networks. In: Proceedings of International Conference on Research in Computational Molecular Biology, 2012. 302--313. Google Scholar

[11] Yu G X, Domeniconi C, Rangwala H, et al. Transductive multi-label ensemble classification for protein function prediction. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data mining, 2012. 1077--1085. Google Scholar

[12] Schwikowski B, Uetz P, Fields S. A network of protein-protein interactions in yeast.. Nat Biotechnol, 2000, 18: 1257-1261 CrossRef PubMed Google Scholar

[13] Yu G X, Fu G Y, Wang J. NewGOA: Predicting New GO Annotations of Proteins by Bi-Random Walks on a Hybrid Graph.. IEEE/ACM Trans Comput Biol Bioinf, 2018, 15: 1390-1402 CrossRef PubMed Google Scholar

[14] Wang S, Cho H, Zhai C X. Exploiting ontology graph for predicting sparsely annotated gene function.. Bioinformatics, 2015, 31: i357-i364 CrossRef PubMed Google Scholar

[15] Yu G X, Zhao Y W, Lu C. HashGO: hashing gene ontology for protein function prediction.. Comput Biol Chem, 2017, 71: 264-273 CrossRef PubMed Google Scholar

[16] Zhao Y W, Fu G Y, Wang J, et al. Gene function prediction based on Gene Ontology Hierarchy Preserving Hashing. Genomics, 2018. doi.org/10.1016/j.ygeno.2018.02.008. Google Scholar

[17] Yu G X, Fu G Y, Wang J. Predicting irrelevant functions of proteins based on dimensionality reduction. Sci Sin-Inf, 2017, 47: 1349-1368 CrossRef Google Scholar

[18] Tao Y, Sam L, Li J. Information theory applied to the sparse gene ontology annotation network to predict novel gene function.. Bioinformatics, 2007, 23: i529-i538 CrossRef PubMed Google Scholar

[19] Done B, Khatri P, Done A. Predicting novel human gene ontology annotations using semantic analysis.. IEEE/ACM Trans Comput Biol Bioinf, 2010, 7: 91-99 CrossRef PubMed Google Scholar

[20] Yu G X, Zhu H L, Domeniconi C. Predicting protein functions using incomplete hierarchical labels.. BMC BioInf, 2015, 16: 1 CrossRef PubMed Google Scholar

[21] Pillai I, Fumera G, Roli F. Threshold optimisation for multi-label classifiers. Pattern Recognition, 2013, 46: 2055-2065 CrossRef Google Scholar

[22] Lu H L, Vaidya J, Atluri V. Optimal boolean matrix decomposition: application to role engineering. In: Proceedings of IEEE International Conference on Data Engineering, 2008. 297--306. Google Scholar

[23] Miettinen P, Vreeken J. Model order selection for boolean matrix factorization. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data mining, 2011. 51--59. Google Scholar

[24] Miettinen P, Mielikainen T, Gionis A. The Discrete Basis Problem. IEEE Trans Knowl Data Eng, 2008, 20: 1348-1362 CrossRef Google Scholar

[25] Karaev S, Miettinen P, Vreeken J. Getting to know the unknown unknowns: destructive-noise resistant boolean matrix factorization. In: Proceedings of SIAM International Conference on Data Mining, 2015. 325--333. Google Scholar

[26] Zhang Z, Li T, Ding C, et al. Binary matrix factorization with applications. In: Proceedings of IEEE International Conference on Data Mining, 2007. 391--400. Google Scholar

[27] Mikhail B, Niyogi P, Sindhwani V. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res, 2006, 7: 2399--2434. Google Scholar

[28] Fu G Y, Yu G X, Wang J, et al. Novel protein-function prediction using a directed hybrid graph. Sci Sin Inform, 2016, 46: 461--475. Google Scholar

[29] Cai D, He X F, Han J W. Graph Regularized Nonnegative Matrix Factorization for Data Representation.. IEEE Trans Pattern Anal Mach Intell, 2011, 33: 1548-1560 CrossRef PubMed Google Scholar

[30] Zhang M L, Zhou Z H. A Review on Multi-Label Learning Algorithms. IEEE Trans Knowl Data Eng, 2014, 26: 1819-1837 CrossRef Google Scholar

[31] Wilcoxon F. Individual Comparisons by Ranking Methods. Biometrics Bull, 1945, 1: 80-83 CrossRef Google Scholar

[32] Demsar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res, 2006, 7: 1--30. Google Scholar

  • Figure 1

    An example of zero-one matrix factorization

  • Figure 2

    (Color online) Sensitivity analysis of low-rank parameter $k$ on Yeast. (a) CC (Fmax); (b) MF (Fmax); (c) BP (Fmax); (d) CC (Smin); (e) MF (Smin); (f) BP (Smin)

  • Figure 3

    (Color online) Sensitivity analysis of weight parameter $\alpha$ and $\beta$. (a) Yeast CC (Fmax); (b) Arabidopsis CC (Fmax); (c) Mouse CC (Fmax); (d) Yeast CC (Smin); (e) Arabidopsis CC (Smin); (f) Mouse CC (Smin)

  • Table 1   Statistics of functional annotations of proteins
    Species Proteins ($n$) Branch 2016 (Avg $\pm$ std) 2017 (Avg $\pm$ std) Labels ($c$)
    BP 55849 (9.28 $\pm$ 12.57) 56971 (9.47 $\pm$ 13.14) 2036
    Yeast 6017 MF 15783 (2.62 $\pm$ 3.56) 15899 (2.64 $\pm$ 3.58) 777
    CC 17872 (2.97 $\pm$ 4.30) 19765 (3.28 $\pm$ 4.58) 543
    BP 41486 (4.50 $\pm$ 10.12) 47159 (5.11 $\pm$ 11.53) 1649
    Arabidopsis 9228 MF 11517 (1.25 $\pm$ 3.18) 14634 (1.59 $\pm$ 3.84) 600
    CC 13009 (1.41 $\pm$ 3.86) 14012 (1.52 $\pm$ 4.07) 321
    BP 125100 (22.40 $\pm$ 35.41) 148721 (26.63 $\pm$ 42.03) 5077
    Mouse 5585 MF 23014 (4.12 $\pm$ 5.68) 28746 (5.15 $\pm$ 6.84) 1098
    CC 20842 (3.73 $\pm$ 5.36) 28118 (5.03 $\pm$ 7.04) 731
    BP 153772 (9.57 $\pm$ 18.72) 170727 (10.62 $\pm$ 20.57) 5408
    Human 16073 MF 35524 (2.21 $\pm$ 3.42) 39028 (2.43 $\pm$ 3.63) 1626
    CC 23228 (1.45 $\pm$ 3.01) 27305 (1.70 $\pm$ 3.28) 769
  •   

    Algorithm 1 ZOMF algorithm

    Require:Protein-function association matrix $\textit{{Y}}$, protein-protein interaction matrix $\textit{{W}}$, GO adjacency matrix $\textit{{G}}$, low-rank parameter $k$, weight parameter $\alpha$ and $\beta$;

    Output:Predicted protein-function association matrix $\textit{{Y}}^*$;

    Initialize $\lambda=10^{-16}$, $\varepsilon=0.01$;

    Randomly initialize matrices $\textit{{A}}$ and $\textit{{B}}$ in the range of $(0,1)$;

    Normalize matrices $\textit{{A}}$ and $\textit{{B}}$ according to Eq. (12);DO

    Update matrix $\textit{{A}}$ according to Eq. (10);

    Update matrix $\textit{{B}}$ according to Eq. (11);

    $\lambda=10\lambda$;

    Normalize matrices $\textit{{A}}$ and $\textit{{B}}$ according to Eq. (12);WHILE $(\textit{{A}}_{ik}^2-\textit{{A}}_{ik})^2+(\textit{{B}}_{ks}^2-\textit{{B}}_{ks})^2~\geq~\varepsilon~$

    Predict protein function using Eq. (7) and return matrix $\textit{{Y}}^*$.

  • Table 2   The results on Yeast
    MV ClusDCA NewGOA HPhash ZOMF(Y) ZOMF(GO) ZOMF(PPI) ZOMF
    BP 0.9368 0.9475 0.9455 0.9401 0.9351 0.9351 0.9491 0.9510
    MicroF1 MF 0.9378 0.9470 0.9491 0.9397 0.9363 0.9376 0.9502 0.9554
    CC 0.8911 0.8995 0.8965 0.8731 0.9129 0.9138 0.9193 0.9200
    BP 0.9352 0.9397 0.9154 0.9252 0.9342 0.9353 0.9435 0.9435
    MacroF1 MF 0.9347 0.9464 0.9275 0.9236 0.9387 0.9391 0.9474 0.9474
    CC 0.9192 0.9252 0.8952 0.8956 0.9366 0.9376 0.9449 0.9452
    BP 0.8861 0.9508 0.9552 0.9716 0.9458 0.9497 0.9652 0.9756
    Fmax MF 0.8229 0.8706 0.8647 0.8814 0.8753 0.8759 0.8852 0.8872
    CC 0.7250 0.7684 0.7765 0.8162 0.8070 0.8070 0.8185 0.8196
    BP 1.5707 0.5481 0.3948 0.3986 0.4673 0.4677 0.3689 0.3603
    Smin $\downarrow$ MF 0.4110 0.2011 0.2012 0.1740 0.1945 0.1980 0.1545 0.1543
    CC 0.3677 0.1625 0.1675 0.1357 0.1317 0.1232 0.1093 0.1093
  • Table 3   The results on Arabidopsis
    MV ClusDCA NewGOA HPhash ZOMF(Y) ZOMF(GO) ZOMF(PPI) ZOMF
    BP 0.7977 0.8511 0.8479 0.8325 0.8818 0.8822 0.8850 0.8851
    MicroF1 MF 0.7344 0.7724 0.7709 0.7452 0.8250 0.8260 0.8224 0.8259
    CC 0.8551 0.8863 0.8877 0.8651 0.9099 0.9078 0.9170 0.9171
    BP 0.8162 0.8593 0.8016 0.8337 0.8855 0.8856 0.8868 0.8869
    MacroF1 MF 0.7955 0.8044 0.7372 0.7771 0.8424 0.8432 0.8472 0.8508
    CC 0.8184 0.8370 0.8096 0.7893 0.8556 0.8610 0.8561 0.8639
    BP 0.8337 0.8928 0.9039 0.9146 0.9054 0.9057 0.9068 0.9068
    Fmax MF 0.7319 0.7643 0.7605 0.8093 0.8087 0.8087 0.7910 0.7910
    CC 0.6341 0.5882 0.6039 0.7144 0.7069 0.7101 0.7057 0.7144
    BP 2.1709 1.0860 1.0391 1.0097 1.0065 1.0056 0.9968 0.9964
    Smin $\downarrow$ MF 0.9126 0.7410 0.7707 0.6449 0.6130 0.5930 0.6005 0.6003
    CC 0.4977 0.5761 0.5077 0.2576 0.2540 0.2546 0.2600 0.2476
  • Table 4   The results on Mouse
    MV ClusDCA NewGOA HPhash ZOMF(Y) ZOMF(GO) ZOMF(PPI) ZOMF
    BP 0.7646 0.8229 0.8211 0.8131 0.8527 0.8538 0.8682 0.8703
    MicroF1 MF 0.7482 0.7962 0.7942 0.7827 0.8575 0.8575 0.8580 0.8581
    CC 0.7061 0.7541 0.7542 0.7263 0.8138 0.8137 0.8197 0.8202
    BP 0.7689 0.8284 0.7558 0.8015 0.8569 0.8570 0.8572 0.8576
    MacroF1 MF 0.7651 0.8098 0.7371 0.7833 0.8283 0.8342 0.8352 0.8390
    CC 0.7464 0.7706 0.7077 0.7498 0.8108 0.8137 0.8195 0.8196
    BP 0.7890 0.8582 0.8537 0.8916 0.8775 0.8776 0.8806 0.8806
    Fmax MF 0.7091 0.7862 0.7432 0.7983 0.7997 0.7997 0.8001 0.8001
    CC 0.6334 0.6697 0.6207 0.7062 0.7038 0.7037 0.7092 0.7093
    BP 7.2180 6.1819 5.2861 5.4010 2.5646 2.5648 2.5440 2.5345
    Smin $\downarrow$ MF 1.1973 0.7469 0.8482 0.8490 0.6953 0.6953 0.6881 0.6852
    CC 0.9895 0.7845 0.9694 0.7990 0.6225 0.6126 0.6093 0.6071
  • Table 5   The results on Human
    MV ClusDCA NewGOA HPhash ZOMF(Y) ZOMF(GO) ZOMF(PPI) ZOMF
    BP 0.8538 0.8862 0.8876 0.8819 0.9051 0.9051 0.9131 0.9139
    MicroF1 MF 0.8638 0.8942 0.8993 0.8883 0.9130 0.9134 0.9219 0.9228
    CC 0.8356 0.8623 0.8608 0.8431 0.8752 0.8751 0.8854 0.8951
    BP 0.8699 0.9015 0.8480 0.8865 0.9120 0.9121 0.9139 0.9148
    MacroF1 MF 0.8792 0.9153 0.8759 0.8932 0.9201 0.9202 0.9225 0.9225
    CC 0.8478 0.8776 0.8301 0.8520 0.8833 0.8834 0.8906 0.8935
    BP 0.7538 0.8637 0.8428 0.8959 0.8812 0.8812 0.8862 0.8863
    Fmax MF 0.6493 0.7408 0.6902 0.7587 0.7494 0.7499 0.7527 0.7559
    CC 0.4598 0.5502 0.4692 0.5643 0.5524 0.5623 0.5649 0.5688
    BP 3.1853 1.6946 1.4567 1.3245 0.7708 0.7701 0.7504 0.7510
    Smin $\downarrow$ MF 0.5476 0.2589 0.3674 0.2541 0.2067 0.2060 0.1995 0.1975
    CC 0.4465 0.2037 0.3725 0.2298 0.1697 0.1698 0.1573 0.1474
  • Table 6   Statistics of runtime cost of five comparing methods on different datasets
    Species Branch MV ClusDCA NewGOA HPhash ZOMF
    BP 0.71 91.83 605.14 2548.95 88.56
    Yeast MF 1.28 67.61 224.92 267.09 34.86
    CC 0.92 64.04 89.00 131.76 40.23
    BP 0.59 54.47 537.48 1909.87 46.78
    Arabidopsis MF 0.33 32.27 207.91 163.81 26.52
    CC 0.21 22.35 83.88 58.36 12.13
    BP 1.76 208.15 725.90 55146.48 228.33
    Mouse MF 1.18 85.92 266.70 690.48 51.41
    CC 1.32 85.18 114.63 300.29 36.31
    BP 9.17 540.57 1292.68 64863.61 968.58
    Human MF 10.31 526.34 411.06 1308.74 470.72
    CC 10.26 591.77 183.00 257.22 310.38
    Total 38.04 2370.50 4742.30 127646.66 2314.81

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1