SCIENTIA SINICA Informationis, Volume 47, Issue 10: 1349-1368(2017) https://doi.org/10.1360/N112017-00009

Predicting irrelevant functions of proteins based on dimensionality reduction

More info
  • ReceivedJan 10, 2017
  • AcceptedMar 16, 2017
  • PublishedAug 31, 2017


Proteins are the foundation for many life processes and accurately annotating their biological functions can significantly boost the development of life sciences. Current function prediction models focus on employing the knowledge that proteins perform specific functions (positive examples), but ignore the knowledge that some functions are irrelevant for a protein (negative examples). Recent research indicates that incorporating negative examples can reduce the complexity and improve the accuracy of protein function prediction. In this paper, we propose an approach for predicting irrelevant functions of proteins based on dimensionality reduction (IFDR). Initially, IFDR performs random walks through matrices in a protein-protein interactions (PPI) network, as well as the corresponding protein-function association matrices, in order to explore the underlying relationships between proteins and model the missing functional annotations of proteins. Next, IFDR uses single value decomposition to project these matrices into low-dimensional numerical matrices. Finally, IFDR uses semi-supervised regression to predict negative examples of proteins. Experiments on S. cerevisiae, H. sapiens, and A. thaliana data demonstrate that IFDR can more accurately predict negative examples when compared to related methods. Dimensionality reduction in the network space and label space can both improve the accuracy of negative example prediction.

Funded by

中央高校基本科研业务费(2362015X- K07)









[1] Roberts R J. Identifying protein function--a call for community action.. PLoS Biol, 2004, 2: e42 CrossRef PubMed Google Scholar

[2] Radivojac P, Clark W T, Oron T R. A large-scale evaluation of computational protein function prediction.. Nat Meth, 2013, 10: 221-227 CrossRef PubMed Google Scholar

[3] Haber M, Mezzavilla M, Xue Y. Ancient DNA and the rewriting of human history: be sparing with Occam's razor.. Genome Biol, 2016, 17: 1-19 CrossRef PubMed Google Scholar

[4] Ashburner M, Ball C A, Blake J A. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.. Nat Genet, 2000, 25: 25-29 CrossRef PubMed Google Scholar

[5] Hiltemann S, Mei H, de Hollander M. CGtag: complete genomics toolkit and annotation in a cloud-based Galaxy.. GigaSci, 2014, 3: 1 CrossRef PubMed Google Scholar

[6] Lee D, Redfern O, Orengo C. Predicting protein function from sequence and structure.. Nat Rev Mol Cell Biol, 2007, 8: 995-1005 CrossRef PubMed Google Scholar

[7] Deng L, Chen Z. An Integrated Framework for Functional Annotation of Protein Structural Domains.. IEEE/ACM Trans Comput Biol Bioinf, 2015, 12: 902-913 CrossRef PubMed Google Scholar

[8] Sharan R, Ulitsky I, Shamir R. Network-based prediction of protein function. Molecular Syst Biology, 2007, 3: 1--15. Google Scholar

[9] Schwikowski B, Uetz P, Fields S. A network of protein-protein interactions in yeast.. Nat Biotechnol, 2000, 18: 1257-1261 CrossRef PubMed Google Scholar

[10] Vazquez A, Flammini A, Maritan A. Global protein function prediction from protein-protein interaction networks.. Nat Biotech, 2003, 21: 697-700 CrossRef PubMed Google Scholar

[11] Mostafavi S, Ray D, Warde-Farley D, et al. GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biology, 2008, 9: S4. Google Scholar

[12] Cesa-Bianchi N, Re M, Valentini G. Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference. Mach Learn, 2012, 88: 209-241 CrossRef Google Scholar

[13] Yu G X, Zhu H L, Domeniconi C, et al. Integrating multiple networks for protein function prediction. BMC Syst Biology, 2015, 9: S3. Google Scholar

[14] Yu G, Fu G, Wang J. Predicting Protein Function via Semantic Integration of Multiple Networks.. IEEE/ACM Trans Comput Biol Bioinf, 2016, 13: 220-232 CrossRef PubMed Google Scholar

[15] Valentini G. True path rule hierarchical ensembles for genome-wide gene function prediction.. IEEE/ACM Trans Comput Biol Bioinf, 2011, 8: 832-847 CrossRef PubMed Google Scholar

[16] Zhang M L, Zhou Z H. A Review on Multi-Label Learning Algorithms. IEEE Trans Knowl Data Eng, 2014, 26: 1819-1837 CrossRef Google Scholar

[17] Yu G X, Domeniconi C, Rangwala H, et al. Transductive multi-label ensemble classification for protein function prediction. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Beijing, 2012. 1077--1085. Google Scholar

[18] Wu J S, Huang S J, Zhou Z H. Genome-Wide Protein Function Prediction through Multi-Instance Multi-Label Learning.. IEEE/ACM Trans Comput Biol Bioinf, 2014, 11: 891-902 CrossRef PubMed Google Scholar

[19] Yu G X, Domeniconi C, Rangwala H, et al. Protein function prediction using dependence maximization. In: Proceedings of the 24th European Conference on Machine Learning. Berlin: Springer, 2013. 574--589. Google Scholar

[20] Yu G X, Zhu H L, Domeniconi C. Predicting protein function using incomplete hierarchical labels. BMC Bioinform, 2015, 16: 1. Google Scholar

[21] Yao S, Yoo S, Yu D. Prior knowledge driven Granger causality analysis on gene regulatory network discovery.. BMC BioInf, 2015, 16: 273 CrossRef PubMed Google Scholar

[22] Fu G Y, Yu G X, Wang J, et al. Novel protein-function prediction using a direct hybrid graph. Sci Sin Inform, 2016, 46: 461--475. Google Scholar

[23] Biases in the Experimental Annotations of Protein Function and Their Effect on Our Understanding of Protein Function Space. PLoS Comput Biol, 2013, 9: e1003063 CrossRef PubMed ADS arXiv Google Scholar

[24] Youngs N, Penfold-Brown D, Drew K. Parametric Bayesian priors and better choice of negative examples improve protein function prediction.. Bioinformatics, 2013, 29: 1190-1198 CrossRef PubMed Google Scholar

[25] Wang H, Huang H, Ding C. Function-function correlated multi-label protein function prediction over interaction networks.. J Comp Biol, 2013, 20: 322-343 CrossRef PubMed Google Scholar

[26] Negative Example Selection for Protein Function Prediction: The NoGO Database. PLoS Comput Biol, 2014, 10: e1003644 CrossRef PubMed ADS Google Scholar

[27] Yu G, Rangwala H, Domeniconi C. Protein Function Prediction with Incomplete Annotations.. IEEE/ACM Trans Comput Biol Bioinf, 2014, 11: 579-591 CrossRef PubMed Google Scholar

[28] Blei D, Ng A, Jordan M. Latent dirichlet allocation. J Mach Learn Res, 2003, 3: 993--1022. Google Scholar

[29] Fu G, Wang J, Yang B. NegGOA: negative GO annotations selection using ontology structure.. Bioinformatics, 2016, 32: 2996-3004 CrossRef PubMed Google Scholar

[30] Tong H, Faloutsos C, Pan J Y. Random walk with restart: fast solutions and applications. Knowl Inf Syst, 2008, 14: 327-346 CrossRef Google Scholar

[31] Fu G Y, Yu G X, Wang J, et al. Protein function prediction using positive and negative examples. J Comput Sci Dev, 2016, 53: 1753--1765. Google Scholar

[32] Cao M, Pietras C M, Feng X. New directions for diffusion-based network prediction of protein function: incorporating pathways with confidence.. Bioinformatics, 2014, 30: i219-i227 CrossRef PubMed Google Scholar

[33] Cho H, Berger B, Peng J. Diffusion component analysis: unraveling functional topology in biological networks. In: Proceedings of the 19th Annual International Conference on Research in Computational Molecular Biology. Berlin: Springer, 2015. 62--64. Google Scholar

[34] Wang S, Cho H, Zhai C X. Exploiting ontology graph for predicting sparsely annotated gene function.. Bioinformatics, 2015, 31: i357-i364 CrossRef PubMed Google Scholar

[35] Banerjee S, Roy A. Linear Algebra and Matrix Analysis for Statistics. BocaRaton: CRC Press, 2014. Google Scholar

[36] Guo M Z, Dai Q G, Xu L Q, et al. On protein complexes identifying algorithm based on the novel modularity function. J Comput Res Dev, 2014, 51: 2178--2186. Google Scholar

[37] Kullback S, Leibler R A. On Information and Sufficiency. Ann Math Statist, 1951, 22: 79-86 CrossRef Google Scholar

[38] Pandey G, Myers C L, Kumar V. Incorporating functional inter-relationships into protein function prediction algorithms. BMC Bioinform, 2009, 10: 1. Google Scholar

[39] Xiao-Fei Zhang , Dao-Qing Dai . A framework for incorporating functional interrelationships into protein function prediction algorithms.. IEEE/ACM Trans Comput Biol Bioinf, 2012, 9: 740-753 CrossRef PubMed Google Scholar

[40] Alter O, Brown P O, Botstein D. Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA, 2000, 97: 10101-10106 CrossRef ADS Google Scholar

[41] Zhu X J. Semi-supervised learning literature survey. Comput Sci, 2008, 37: 63--77. Google Scholar

[42] Belkin M, Niyogi P, Sindhwani V. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res, 2006, 7: 2399--2434. Google Scholar

[43] Spirin V, Mirny L A. Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci USA, 2003, 100: 12123-12128 CrossRef PubMed ADS Google Scholar

[44] Chatr-Aryamontri A, Breitkreutz B J, Oughtred R, et al. The BioGRID interaction database: 2015 update. Nucleic Acids Res, 2015, 43: 470--478. Google Scholar

[45] Wilcoxon F. Individual Comparisons by Ranking Methods. Biometrics Bull, 1945, 1: 80-83 CrossRef Google Scholar

[46] Mostafavi S, Morris Q. Fast integration of heterogeneous data sources for predicting gene function with limited annotation.. Bioinformatics, 2010, 26: 1759-1765 CrossRef PubMed Google Scholar

[47] van Herpen T W, Goryunova S V, van der Schoot J. Alpha-gliadin genes from the A, B, and D genomes of wheat contain different sets of celiac disease epitopes.. BMC Genomics, 2006, 7: 1 CrossRef PubMed Google Scholar

[48] Piovesan D, Giollo M, Leonardi E. INGA: protein function prediction combining interaction networks, domain assignments and sequence similarity.. Nucleic Acids Res, 2015, 43: W134-W140 CrossRef PubMed Google Scholar

[49] Lan L, Djuric N, Guo Y, et al. MS-kNN: protein function prediction by integrating multiple data sources. BMC Bioinformatics, 2013, 14: S8. Google Scholar

[50] Szklarczyk D, Franceschini A, Wyder S. STRING v10: protein-protein interaction networks, integrated over the tree of life.. Nucleic Acids Res, 2015, 43: D447-D452 CrossRef PubMed Google Scholar

  • Figure 1

    (Color online) Influence of the size of target dimensionality on H.sapiens. $d$ and $c$ represent the target dimensionality of PPI network and function label space, respectively. (a) BP; (b) CC; (c) MF

  • Figure 2

    (Color online) Statistics of runtime cost of four comparing methods on different datasets. (a) H.sapiens; protectłinebreak (b) S.cerevisiae; (c) A.thaliana

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有