SCIENTIA SINICA Informationis, Volume 47 , Issue 10 : 1334-1348(2017) https://doi.org/10.1360/N112017-00012

Computing exact permutation ${\boldsymbol~p}$-values for phosphorylation motifs

More info
  • ReceivedJan 12, 2017
  • AcceptedFeb 9, 2017
  • PublishedAug 30, 2017


Protein phosphorylation motifs refer to position-specific amino acid patterns near phosphorylation sites. Mining phosphorylation motifs is an important task in the field of bioinformatics and several efficient methods have been proposed to uncover phosphorylation motifs. However, a large percentage of the phosphorylation motifs discovered by these algorithms are false positives. Using such motifs to perform further research will lead to inaccurate conclusions. Generally, statistical significance testing is an effective technique to filter out meaningless phosphorylation motifs. Among statistical significance testing methods, permutation testing is a commonly used method. Its usability and popularity can be attributed to its non-parametric nature. However, in permutation testing, several drawbacks narrow its range of usability. In this paper, we provide an analysis of these disadvantages and propose an algorithm called exact permutation $p$-values for phosphorylation motifs (EPPM) for generating an exact null distribution, from which the exact $p$-values of tested phosphorylation motifs can be calculated. Experiments on several datasets demonstrate that EPPM can successfully alleviate the aforementioned disadvantages and outperform the direct permutation-based method for several performance measures. To the best of our knowledge, there are still no methods in the literature that can compute exact permutation $p$-values for assessing phosphorylation motifs.

Funded by



[1] Manning G, Plowman G D, Hunter T. Evolution of protein kinase signaling from yeast to man. Trends Biochem Sci, 2002, 27: 514-520 CrossRef Google Scholar

[2] Turk B E. Understanding and exploiting substrate recognition by protein kinases.. Curr Opin Chem Biol, 2008, 12: 4-10 CrossRef PubMed Google Scholar

[3] Yu Y, Yoon S O, Poulogiannis G. Phosphoproteomic Analysis Identifies Grb10 as an mTORC1 Substrate That Negatively Regulates Insulin Signaling. Science, 2011, 332: 1322-1326 CrossRef PubMed ADS Google Scholar

[4] Huang Z Y, Yu Y L, Fang C Y, et al. Progress in identification of protein phosphorylation by mass spectrometry. J Chinese Mass Spectrome Soc, 2003, 24: 495--500. Google Scholar

[5] Schwartz D, Gygi S P. An iterative statistical approach to the identification of protein phosphorylation motifs from large-scale data sets.. Nat Biotechnol, 2005, 23: 1391-1398 CrossRef PubMed Google Scholar

[6] Ritz A, Shakhnarovich G, Salomon A R. Discovery of phosphorylation motif mixtures in phosphoproteomics data.. Bioinformatics, 2009, 25: 14-21 CrossRef PubMed Google Scholar

[7] Liu X, Wu J, Gu F. Discriminative pattern mining and its applications in bioinformatics.. Brief Bioinform, 2015, 16: 884-900 CrossRef PubMed Google Scholar

[8] Discovery of Protein Phosphorylation Motifs through Exploratory Data Analysis. PLoS ONE, 2011, 6: e20025 CrossRef PubMed ADS Google Scholar

[9] Wang T, Kettenbach A N, Gerber S A. MMFPh: a maximal motif finder for phosphoproteomics datasets.. Bioinformatics, 2012, 28: 1562-1570 CrossRef PubMed Google Scholar

[10] He Z Y, Yang C, Guo G Y, et al. Motif-all: discovering all phosphorylation motifs. BMC Bioinform, 2011, 12: S22. Google Scholar

[11] Agrawal R, Srikant R. Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases. Santiago de Chile: Morgan Kaufmann, 1994. 487--499. Google Scholar

[12] Xiaoqing Liu , Jun Wu , Haipeng Gong . Mining Conditional Phosphorylation Motifs.. IEEE/ACM Trans Comput Biol Bioinf, 2014, 11: 915-927 CrossRef PubMed Google Scholar

[13] Gong H, He Z. Permutation methods for testing the significance of phosphorylation motifs. Stat Its Interface, 2012, 5: 61-73 CrossRef Google Scholar

[14] Han J, Cheng H, Xin D. Frequent pattern mining: current status and future directions. Data Min Knowl Disc, 2007, 15: 55-86 CrossRef Google Scholar

[15] Song W, Li J H, Xu Z Y, et al. Research on a new concise representation of frequent itemset and its mining algorithm. J Comput Res Dev, 2010, 47: 277--285. Google Scholar

[16] Han J W, Pei J, Yin Y W. Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. Dallas: ACM, 2000. 1--12. Google Scholar

[17] Neyman J, Pearson E S. On the Problem of the Most Efficient Tests of Statistical Hypotheses. New York: Springer, 1992. 23--33. Google Scholar

[18] Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc, 1995, 57: 25--133. Google Scholar

[19] Wu J, He Z Y, Gu F Y, et al. Computing exact permutation p-values for association rules. Inf Sci, 2016, 346: 146--162. Google Scholar

[20] Farriol-Mathis N, Garavelli J S, Boeckmann B. Annotation of post-translational modifications in the Swiss-Prot knowledge base.. PROTEOMICS, 2004, 4: 1537-1550 CrossRef PubMed Google Scholar

[21] Dinkel H, Chica C, Via A, et al. Phospho.ELM: a database of phosphorylation sites-update 2011. Nucleic Acids Res, 2011, 39: 261--267. Google Scholar

[22] Blom N, Gammeltoft S, Brunak S. Sequence and structure-based prediction of eukaryotic protein phosphorylation sites.. J Mol Biol, 1999, 294: 1351-1362 CrossRef PubMed Google Scholar

[23] Dang T H, Van Leemput K, Verschoren A. Prediction of kinase-specific phosphorylation sites using conditional random fields.. Bioinformatics, 2008, 24: 2857-2864 CrossRef PubMed Google Scholar

  • Figure 1

    (Color online) (a) The original data set; (b) the permuted data set

  • Figure 3

    (Color online) The number of phosphorylation motifs whose $p$-values are zeros and non-zeros reported by DSP. (a) Non-kinase-specific; (b) CDK-specific; (c) PKA-specific

  • Figure 4

    (Color online) The distribution of the test statistic values

  • Figure 5

    (Color online) (a) The $p$-values of a randomly selected phosphorylation motifs with 100 different runs on CDK-specific data set reported by DSP; (b) the number of significant phosphorylation motifs returned from DSP and EPPM in 100 different runs on CDK-specific data set

  • Figure 6

    (Color online) The running time of DSP with different number of permutations and EPPM on PKA-specific data set

  • Figure 7

    (Color online) The running time of DSP and EPPM under different min$\_$sups on each data set. (a) Non-kinase-specific; (b) CDK-specific; (c) PKA-specific


    Algorithm 1 EPPM ($P$, $N$, min$\_$sup, $\alpha$)

    Require:前景集合$P$; 背景集合$N$; 最小支持度阈值min$\_$sup; 置信水平$\alpha$.


    Result $\leftarrow~\emptyset$, $T~\leftarrow~\emptyset$, sum$\_$ts $\leftarrow~0$;

    $M_1~\leftarrow$ FP-Growth ($P$, min$\_$sup);

    $M_2~\leftarrow$ FP-Growth ($D$, min$\_$sup);

    for each $m_1~\in~{M_1}$


    end for

    for each $m_2~\in~{M_2}$

    for $s~\leftarrow~L(m_2)$ to $U(m_2)$




    sum$\_$ts $\leftarrow$ sum$\_$ts+${\rm~fc}$;

    end for

    end for



    for each $~m_1~\in~{M_1}$



    end for


    return Result.

  • Table 1   The distributions of the $p$-values returned from DSP with the different number of permutations and EPPM on non-kinase-specific data set
    $p$-value 250 500 1000 2000 4000 EPPM
    $(0.1,~1]$ 3290 3290 3290 3270 3270 3270
    $(0.01,~0.1]$ 1700 1660 1660 1680 1680 1680
    $(0.001,~0.01]$ 905 900 900 889 889 889
    $(10^{-4},~0.001]$ 571 566 550 559 554 554
    $(10^{-5},~10^{-4}]$ 273 316 318 344 342 340
    $(10^{-6},~10^{-5}]$ 135 197 216 218 225 225
    $(10^{-7},~10^{-6}]$ 33 69 134 121 123 124
    $(0,~10^{-7}]$ 0 0 0 45 87 484
    $0$ 659 568 498 440 396 0

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备17057255号       京公网安备11010102003388号