logo

SCIENTIA SINICA Informationis, Volume 47, Issue 10: 1349-1368(2017) https://doi.org/10.1360/N112017-00009

Predicting irrelevant functions of proteins based on dimensionality reduction

More info
  • ReceivedJan 10, 2017
  • AcceptedMar 16, 2017
  • PublishedAug 31, 2017

Abstract

Proteins are the foundation for many life processes and accurately annotating their biological functions can significantly boost the development of life sciences. Current function prediction models focus on employing the knowledge that proteins perform specific functions (positive examples), but ignore the knowledge that some functions are irrelevant for a protein (negative examples). Recent research indicates that incorporating negative examples can reduce the complexity and improve the accuracy of protein function prediction. In this paper, we propose an approach for predicting irrelevant functions of proteins based on dimensionality reduction (IFDR). Initially, IFDR performs random walks through matrices in a protein-protein interactions (PPI) network, as well as the corresponding protein-function association matrices, in order to explore the underlying relationships between proteins and model the missing functional annotations of proteins. Next, IFDR uses single value decomposition to project these matrices into low-dimensional numerical matrices. Finally, IFDR uses semi-supervised regression to predict negative examples of proteins. Experiments on S. cerevisiae, H. sapiens, and A. thaliana data demonstrate that IFDR can more accurately predict negative examples when compared to related methods. Dimensionality reduction in the network space and label space can both improve the accuracy of negative example prediction.


Funded by

国家自然科学基金(61402378,61571163,61532014,61671189)

重庆市研究生科研创新项目(CYS16070)

重庆市基础与前沿研究计划项目(cstc2014jcyjA40031,cstc2016jcyjA0351)

中央高校基本科研业务费(2362015X- K07,XDJK2016B009,XDJK2017D061)


References

[1] Roberts R J. Identifying protein function--a call for community action.. PLoS Biol, 2004, 2: e42 CrossRef PubMed Google Scholar

[2] Radivojac P, Clark W T, Oron T R. A large-scale evaluation of computational protein function prediction.. Nat Meth, 2013, 10: 221-227 CrossRef PubMed Google Scholar

[3] Haber M, Mezzavilla M, Xue Y. Ancient DNA and the rewriting of human history: be sparing with Occam's razor.. Genome Biol, 2016, 17: 1-19 CrossRef PubMed Google Scholar

[4] Ashburner M, Ball C A, Blake J A. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.. Nat Genet, 2000, 25: 25-29 CrossRef PubMed Google Scholar

[5] Hiltemann S, Mei H, de Hollander M. CGtag: complete genomics toolkit and annotation in a cloud-based Galaxy.. GigaSci, 2014, 3: 1 CrossRef PubMed Google Scholar

[6] Lee D, Redfern O, Orengo C. Predicting protein function from sequence and structure.. Nat Rev Mol Cell Biol, 2007, 8: 995-1005 CrossRef PubMed Google Scholar

[7] Deng L, Chen Z. An Integrated Framework for Functional Annotation of Protein Structural Domains.. IEEE/ACM Trans Comput Biol Bioinf, 2015, 12: 902-913 CrossRef PubMed Google Scholar

[8] Sharan R, Ulitsky I, Shamir R. Network-based prediction of protein function. Molecular Syst Biology, 2007, 3: 1--15. Google Scholar

[9] Schwikowski B, Uetz P, Fields S. A network of protein-protein interactions in yeast.. Nat Biotechnol, 2000, 18: 1257-1261 CrossRef PubMed Google Scholar

[10] Vazquez A, Flammini A, Maritan A. Global protein function prediction from protein-protein interaction networks.. Nat Biotech, 2003, 21: 697-700 CrossRef PubMed Google Scholar

[11] Mostafavi S, Ray D, Warde-Farley D, et al. GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biology, 2008, 9: S4. Google Scholar

[12] Cesa-Bianchi N, Re M, Valentini G. Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference. Mach Learn, 2012, 88: 209-241 CrossRef Google Scholar

[13] Yu G X, Zhu H L, Domeniconi C, et al. Integrating multiple networks for protein function prediction. BMC Syst Biology, 2015, 9: S3. Google Scholar

[14] Yu G, Fu G, Wang J. Predicting Protein Function via Semantic Integration of Multiple Networks.. IEEE/ACM Trans Comput Biol Bioinf, 2016, 13: 220-232 CrossRef PubMed Google Scholar

[15] Valentini G. True path rule hierarchical ensembles for genome-wide gene function prediction.. IEEE/ACM Trans Comput Biol Bioinf, 2011, 8: 832-847 CrossRef PubMed Google Scholar

[16] Zhang M L, Zhou Z H. A Review on Multi-Label Learning Algorithms. IEEE Trans Knowl Data Eng, 2014, 26: 1819-1837 CrossRef Google Scholar

[17] Yu G X, Domeniconi C, Rangwala H, et al. Transductive multi-label ensemble classification for protein function prediction. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Beijing, 2012. 1077--1085. Google Scholar

[18] Wu J S, Huang S J, Zhou Z H. Genome-Wide Protein Function Prediction through Multi-Instance Multi-Label Learning.. IEEE/ACM Trans Comput Biol Bioinf, 2014, 11: 891-902 CrossRef PubMed Google Scholar

[19] Yu G X, Domeniconi C, Rangwala H, et al. Protein function prediction using dependence maximization. In: Proceedings of the 24th European Conference on Machine Learning. Berlin: Springer, 2013. 574--589. Google Scholar

[20] Yu G X, Zhu H L, Domeniconi C. Predicting protein function using incomplete hierarchical labels. BMC Bioinform, 2015, 16: 1. Google Scholar

[21] Yao S, Yoo S, Yu D. Prior knowledge driven Granger causality analysis on gene regulatory network discovery.. BMC BioInf, 2015, 16: 273 CrossRef PubMed Google Scholar

[22] Fu G, Wang J, Zhang Z. Novel protein-function prediction using a directed hybrid graph. Sci Sin-Inf, 2016, 46: 461-475 CrossRef Google Scholar

[23] Biases in the Experimental Annotations of Protein Function and Their Effect on Our Understanding of Protein Function Space. PLoS Comput Biol, 2013, 9: e1003063 CrossRef PubMed ADS arXiv Google Scholar

[24] Youngs N, Penfold-Brown D, Drew K. Parametric Bayesian priors and better choice of negative examples improve protein function prediction.. Bioinformatics, 2013, 29: 1190-1198 CrossRef PubMed Google Scholar

[25] Wang H, Huang H, Ding C. Function-function correlated multi-label protein function prediction over interaction networks.. J Comp Biol, 2013, 20: 322-343 CrossRef PubMed Google Scholar

[26] Negative Example Selection for Protein Function Prediction: The NoGO Database. PLoS Comput Biol, 2014, 10: e1003644 CrossRef PubMed ADS Google Scholar

[27] Yu G, Rangwala H, Domeniconi C. Protein Function Prediction with Incomplete Annotations.. IEEE/ACM Trans Comput Biol Bioinf, 2014, 11: 579-591 CrossRef PubMed Google Scholar

[28] Blei D, Ng A, Jordan M. Latent dirichlet allocation. J Mach Learn Res, 2003, 3: 993--1022. Google Scholar

[29] Fu G, Wang J, Yang B. NegGOA: negative GO annotations selection using ontology structure.. Bioinformatics, 2016, 32: 2996-3004 CrossRef PubMed Google Scholar

[30] Tong H, Faloutsos C, Pan J Y. Random walk with restart: fast solutions and applications. Knowl Inf Syst, 2008, 14: 327-346 CrossRef Google Scholar

[31] Fu G Y, Yu G X, Wang J, et al. Protein function prediction using positive and negative examples. J Comput Sci Dev, 2016, 53: 1753--1765. Google Scholar

[32] Cao M, Pietras C M, Feng X. New directions for diffusion-based network prediction of protein function: incorporating pathways with confidence.. Bioinformatics, 2014, 30: i219-i227 CrossRef PubMed Google Scholar

[33] Cho H, Berger B, Peng J. Diffusion component analysis: unraveling functional topology in biological networks. In: Proceedings of the 19th Annual International Conference on Research in Computational Molecular Biology. Berlin: Springer, 2015. 62--64. Google Scholar

[34] Wang S, Cho H, Zhai C X. Exploiting ontology graph for predicting sparsely annotated gene function.. Bioinformatics, 2015, 31: i357-i364 CrossRef PubMed Google Scholar

[35] Banerjee S, Roy A. Linear Algebra and Matrix Analysis for Statistics. BocaRaton: CRC Press, 2014. Google Scholar

[36] Guo M Z, Dai Q G, Xu L Q, et al. On protein complexes identifying algorithm based on the novel modularity function. J Comput Res Dev, 2014, 51: 2178--2186. Google Scholar

[37] Kullback S, Leibler R A. On Information and Sufficiency. Ann Math Statist, 1951, 22: 79-86 CrossRef Google Scholar

[38] Pandey G, Myers C L, Kumar V. Incorporating functional inter-relationships into protein function prediction algorithms. BMC Bioinform, 2009, 10: 1. Google Scholar

[39] Xiao-Fei Zhang , Dao-Qing Dai . A framework for incorporating functional interrelationships into protein function prediction algorithms.. IEEE/ACM Trans Comput Biol Bioinf, 2012, 9: 740-753 CrossRef PubMed Google Scholar

[40] Alter O, Brown P O, Botstein D. Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA, 2000, 97: 10101-10106 CrossRef ADS Google Scholar

[41] Zhu X J. Semi-supervised learning literature survey. Comput Sci, 2008, 37: 63--77. Google Scholar

[42] Belkin M, Niyogi P, Sindhwani V. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res, 2006, 7: 2399--2434. Google Scholar

[43] Spirin V, Mirny L A. Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci USA, 2003, 100: 12123-12128 CrossRef PubMed ADS Google Scholar

[44] Chatr-Aryamontri A, Breitkreutz B J, Oughtred R, et al. The BioGRID interaction database: 2015 update. Nucleic Acids Res, 2015, 43: 470--478. Google Scholar

[45] Wilcoxon F. Individual Comparisons by Ranking Methods. Biometrics Bull, 1945, 1: 80-83 CrossRef Google Scholar

[46] Mostafavi S, Morris Q. Fast integration of heterogeneous data sources for predicting gene function with limited annotation.. Bioinformatics, 2010, 26: 1759-1765 CrossRef PubMed Google Scholar

[47] van Herpen T W, Goryunova S V, van der Schoot J. Alpha-gliadin genes from the A, B, and D genomes of wheat contain different sets of celiac disease epitopes.. BMC Genomics, 2006, 7: 1 CrossRef PubMed Google Scholar

[48] Piovesan D, Giollo M, Leonardi E. INGA: protein function prediction combining interaction networks, domain assignments and sequence similarity.. Nucleic Acids Res, 2015, 43: W134-W140 CrossRef PubMed Google Scholar

[49] Lan L, Djuric N, Guo Y, et al. MS-kNN: protein function prediction by integrating multiple data sources. BMC Bioinformatics, 2013, 14: S8. Google Scholar

[50] Szklarczyk D, Franceschini A, Wyder S. STRING v10: protein-protein interaction networks, integrated over the tree of life.. Nucleic Acids Res, 2015, 43: D447-D452 CrossRef PubMed Google Scholar

  • Figure 1

    (Color online) Influence of the size of target dimensionality on H.sapiens. $d$ and $c$ represent the target dimensionality of PPI network and function label space, respectively. (a) BP; (b) CC; (c) MF

  • Figure 2

    (Color online) Statistics of runtime cost of four comparing methods on different datasets. (a) H.sapiens; protectłinebreak (b) S.cerevisiae; (c) A.thaliana

  • Table 1   Dataset statistics, Avg$\pm$Std is the average number of annotations per protein and standard deviation
    Proteins ($n$) Branch Functions ($m$) Positives (Negatives) Avg$\pm$Std
    BP 15373 790787 (16324) 49.17$\pm$63.14
    H. sapiens 16082 CC 2931 307635 (26963) 19.13$\pm$34.49
    MF 5990 158369 (12042) 9.84$\pm$18.55
    BP 5256 222754 (1374) 37.02$\pm$31.65
    S. cerevisiae 6017 CC 2566 120392 (5456) 20.00$\pm$23.85
    MF 2501 47558 (799) 7.90$\pm$6.89
    BP 5948 229193 (3132) 24.67$\pm$28.01
    A. thaliana 9289 CC 2397 179944 (45523) 19.37$\pm$31.44
    MF 2553 67695 (1846) 7.29$\pm$9.29
  • Table 2   FNs of H. sapiens under different numbers of predicted negative examples. Negative examples are predicted by available annotations in 2015, and validated by updated annotations in 2016
    Data set $l$
    10k 20k 30k 40k 50k 60k 70k 80k
    IFDR 0 0 0 0 0 0 0 0
    IFDR-DCA 5 23 23 24 24 26 26 26
    clusDCA 44 75 99 121 139 157 174 189
    BP ProPN 3 24 33 35 43 44 51 51
    NegGOA 1 2 2 2 2 2 2 2
    SNOB 4 4 12 17 17 18 20 24
    Random 3.94 7.93 12.53 16.42 20.02 24.95 29.56 33.14
    IFDR 0 0 1 1 2 2 2 2
    IFDR-DCA 1 2 2 2 3 3 3 3
    clusDCA 81 159 227 276 332 373 423 455
    CC ProPN 1 2 5 6 14 15 15 19
    NegGOA 0 0 0 3 4 4 4 6
    SNOB 18 18 18 18 18 18 22 22
    Random 5.91 12.02 17.21 23.82 30.01 35.11 40.86 47.23
    IFDR 0 3 4 5 6 8 8 8
    IFDR-DCA 1 1 2 2 2 4 5 6
    clusDCA 12 15 22 29 35 40 44 51
    MF ProPN 11 46 53 69 70 74 76 76
    NegGOA 0 0 0 0 0 0 2 8
    SNOB 38 38 38 38 38 38 39 41
    Random 1.82 4.27 6.48 8.35 10.98 12.45 13.65 17.04
  • Table 3   Results of protein function prediction on multiple networks of Yeast and Human datasets
    Yeast Human
    IFDR ProPN NegGOA SNOB IFDR-DCA IFDR ProPN NegGOA SNOB IFDR-DCA
    BP 0.8518 0.8435 0.7622 0.7626 0.7723 0.8212 0.8182 0.8182 0.8099 0.7766
    MacroF1 CC 0.7459 0.6406 0.5381 0.5087 0.5452 0.8221 0.8124 0.6892 0.6473 0.6341
    MF 0.9118 0.9104 0.8252 0.7928 0.8323 0.8566 0.8512 0.8408 0.8345 0.7940
    BP 0.2292 0.2118 0.2231 0.2219 0.2211 0.2960 0.2905 0.2905 0.2895 0.2922
    RAccuracy CC 0.4217 0.4026 0.4192 0.3800 0.4131 0.4094 0.4068 0.4082 0.4009 0.4041
    MF 0.2777 0.2500 0.2706 0.2691 0.2694 0.4722 0.4616 0.4676 0.4408 0.4267
    BP 0.9593 0.9616 0.9653 0.9697 0.9116 0.9203 0.9305 0.9381 0.9329 0.8845
    AUC CC 0.9789 0.9782 0.9797 0.9808 0.7728 0.9360 0.9436 0.9468 0.9442 0.8203
    MF 0.9817 0.9801 0.9809 0.9817 0.9505 0.9422 0.9489 0.9528 0.9496 0.9032
    BP 0.6902 0.7758 0.7763 0.6388 0.7036 0.8227 0.7915 0.8111 0.6547 0.7710
    Fmax CC 0.7602 0.7870 0.8038 0.7113 0.7717 0.7888 0.7779 0.7878 0.7191 0.7880
    MF 0.8145 0.7953 0.8018 0.7154 0.8028 0.8497 0.8271 0.8318 0.7840 0.8388
  • Table 4   Results of protein function prediction by INGA without/with using negative examples predicted by IFDR
    S. cerevisiae H. sapiens A. thaliana
    INGA INGA-Neg INGA INGA-Neg INGA INGA-Neg
    BP 0.1509 0.1507 0.3819 0.3815 0.2210 0.2212
    AUC CC 0.2050 0.2051 0.5030 0.5053 0.2754 0.2711
    MF 0.1858 0.1851 0.6634 0.6624 0.2360 0.2393
    BP 0.5383 0.5385 0.4558 0.4558 0.4706 0.4703
    Fmax CC 0.7020 0.7086 0.5519 0.5558 0.7529 0.7551
    MF 0.6462 0.6470 0.5702 0.5729 0.5840 0.5827
    BP 0.1163 0.1127 0.0867 0.0818 0.0842 0.0766
    RankingLoss$^{\rm~a)}$ CC 0.0949 0.0902 0.1108 0.0912 0.0333 0.0280
    MF 0.0547 0.0512 0.0937 0.0857 0.1763 0.1512
    BP 12.8237 12.8212 26.4586 26.4523 14.043 14.0204
    Smin$^{\rm~a)}$ CC 5.1867 5.1677 6.8148 6.8088 5.0567 4.9987
    MF 4.2951 4.2387 8.1087 8.0944 5.2668 5.1093
    a) The lower value means the better performance.
  • Table 5   Results of negative examples prediction on H. sapiens by different variants of IFDR
    Data set $l$
    10k 20k 30k 40k 50k 60k 70k 80k
    IFDR 0 0 0 0 0 0 0 0
    IFDR-F 1 1 3 5 13 13 17 18
    BP IFDR-P 30 63 101 134 160 192 228 264
    IFDR-N 28 52 120 174 211 259 300 329
    IFDR-FSVD 7 7 12 14 15 15 20 21
    IFDR 0 0 1 1 2 2 2 2
    IFDR-F 0 1 2 2 5 6 7 7
    CC IFDR-P 34 88 118 158 228 273 349 440
    IFDR-N 23 83 124 173 229 293 363 449
    IFDR-FSVD 0 3 8 12 16 16 18 18
    IFDR 0 3 4 5 6 8 8 8
    IFDR-F 7 11 15 15 17 20 21 22
    MF IFDR-P 10 23 49 66 83 105 124 132
    IFDR-N 12 26 30 36 46 53 58 63
    IFDR-FSVD 3 3 3 4 4 5 8 8

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1