logo

SCIENTIA SINICA Informationis, Volume 47, Issue 11: 1538-1550(2017) https://doi.org/10.1360/N112017-00090

Protein function prediction through multi-instance multi-label transfer learning

More info
  • ReceivedMay 14, 2017
  • AcceptedJun 30, 2017
  • PublishedNov 13, 2017

Abstract

With the release of various genome sequencing projects, there are many species whose genomic sequences have been recently completed. It is essential to annotate the protein functions of these species. Owing to the lack of proteins with known functions, it is important to exploit their relative species with a large number of proteins whose functions are known to assist in predicting the protein functions of these species. In this paper, we treat this task as a multi-instance multilabel transfer learning problem and propose the first multi-instance multilabel transfer learning framework to perform this task. Experiments on two newly completed sequencing species demonstrate that transfer learning contributes to protein function prediction. Moreover, the closer the polygenetic relationship between the source domain species and target domain species, the better the performance of transfer learning.


Funded by

国家自然科学基金(61571233,61271082)

国家重点基础研究发展计划(973)(2011CB302903)

江苏省高校自然科学研究重大项目(14KJA510003)

江苏省重点研发计划(BE2015700)

南京信息工程大学PAPD与CICA- EET


References

[1] Jiang J, Zhai C X. Instance weighting for domain adaptation in NLP. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, 2007. 264--271. Google Scholar

  • Figure 1

    Multi-instance multi-label transfer learning framework(TR-MIML), including the re-weighting data samples from source domain stageand the classification model construction stage

  • Figure 2

    (Color online) Effect on protein function prediction of Rattus norvegicus by transfer learning using five species with different phylogenetic relationship

  •   

    Algorithm 1 Pseudo code of TR-MIML learning framework

    $\widehat{h}~=~$TR-MIML${(}{~D}^{~T}{~,}~{~D}^{~S}{~)}~$

    Input: $D^T:$ Target domain dataset; $D^S:$ Source domain dataset.

    Output: $\widehat{h}:$ Classifier

    Steps:

    1.for $~X_i^S~$ in $D^S$ do

    2.$f_i^S~=~{\rm~miFV}(~{X_i^S~}~)$;

    3.end for

    4.for $~X_i^T$ in $D^T$ do

    5.$f_i^T~=~{\rm~miFV}(~{X_i^T}~)$;

    6.end for

    7.Compute $\beta~$ by solving (2);

    8.Learn the classifier $\widehat{h}$ by solving (5).

  • Table 1   Experimental dataset statistics
    SpeciesProteinsGO termsDomains per protein GO terms per protein
    (Mean$\pm$std.)(Mean$\pm$std.)
    Geobacter sulfurreducens 3793203.20$\pm$1.213.14$\pm$3.33
    Azotobacter vinelandii 4073403.07$\pm$1.164.00$\pm$6.97
    Mus musculus 1167630652.76$\pm$1.8444.64$\pm$50.27
    Rattus norvegicus 599126002.53$\pm$1.7139.51$\pm$44.78
    Homo sapiens 1377333112.98$\pm$4.3055.81$\pm$126.63
    Arabidopsis thaliana 898618112.02$\pm$1.4627.68$\pm$70.37
    Saccharomyces cerevisiae 350915661.86$\pm$1.3615.89$\pm$11.52
  • Table 2   Performance comparison on Geobacter sulfurreducens (target domain) by multiplemulti-instance multi-label learning methods with or without transferlearning. Results demonstrate that transfer learning can improve theperformance of protein function prediction on this task
    Source domainMethodAP ($~\uparrow~)$ CV ($~\downarrow~)$ HL ($~\downarrow~)$ OE ($~\downarrow~)$ RL ($~\downarrow~)$
    TR-MIMLfast0.58$\pm~$0.024.04$\pm~$0.070.15$\pm~$0.000.55$\pm~$0.040.32$\pm~$0.01
    MIMLfast0.44$\pm~$0.024.76$\pm~$0.030.20$\pm~$0.010.71$\pm~$0.030.43$\pm~$0.01
    Mus musculusTR-MIMLNN0.56$\pm~$0.014.28$\pm~$0.200.21$\pm~$0.010.58$\pm~$0.020.36$\pm~$0.03
    MIMLNN0.54$\pm~$0.014.51$\pm~$0.190.24$\pm~$0.010.58$\pm~$0.010.37$\pm~$0.01
    TR-MIMLSVM0.53$\pm~$0.024.16$\pm~$0.020.18$\pm~$0.0 0.61$\pm~$0.010.37$\pm~$0.09
    MIMLSVM0.44$\pm~$0.014.62$\pm~$0.050.19$\pm~$0.020.67$\pm~$0.010.40$\pm~$0.01
    TR-MIMLfast0.56$\pm~$0.024.02$\pm~$0.170.16$\pm~$0.000.58$\pm~$0.030.33$\pm~$0.02
    MIMLfast0.43$\pm~$0.055.22$\pm~$0.070.23$\pm~$0.030.75$\pm~$0.080.42$\pm~$0.06
    Rattus norvegicus TR-MIMLNN0.53$\pm~$0.014.11$\pm~$0.060.16$\pm~$0.02 0.58$\pm~$0.010.38$\pm~$0.01
    MIMLNN0.48$\pm~$0.014.74$\pm~$0.090.19$\pm~$0.000.66$\pm~$0.030.41$\pm~$0.01
    TR-MIMLSVM0.53$\pm~$0.03 4.29$\pm~$0.030.17$\pm~$0.01 0.60$\pm~$0.030.39$\pm~$0.01
    MIMLSVM0.52$\pm~$0.015.10$\pm~$0.080.17$\pm~$0.010.66$\pm~$0.010.35$\pm~$0.02
    TR-MIMLfast0.53$\pm~$0.044.42$\pm~$0.140.17$\pm~$0.000.62$\pm~$0.050.35$\pm~$0.01
    MIMLfast0.50$\pm~$0.024.50$\pm~$0.140.22$\pm~$0.020.66$\pm~$0.040.35$\pm~$0.02
    Saccharomyces cerevisiae TR-MIMLNN0.53$\pm~$0.014.58$\pm~$0.090.14$\pm~$0.040.61$\pm~$0.090.37$\pm~$0.01
    MIMLNN0.52$\pm~$0.024.72$\pm~$0.110.16$\pm~$0.010.62$\pm~$0.020.41$\pm~$0.02
    TR-MIMLSVM0.54$\pm~$0.024.61$\pm~$0.070.17$\pm~$0.010.57$\pm~$0.010.40$\pm~$0.03
    MIMLSVM0.53$\pm~$0.014.72$\pm~$0.040.18$\pm~$0.010.60$\pm~$0.010.43$\pm~$0.01

    a) 粗体表示有迁移学习的结果要显著好于无迁移学习的结果(基于置信度为95%的配对样本t检验).

  • Table 3   Performance comparison on Azotobacter vinelandii (target domain) by multiplemulti-instance multi-label learning methods with or without transferlearning. Results demonstrate that transfer learning can improve theperformance of protein function prediction on this task
    Source domainMethod AP ($~\uparrow~)$ CV ($~\downarrow~)$ HL ($~\downarrow~)$ OE ($~\downarrow~)$ RL ($~\downarrow~)$
    TR-MIMLfast0.55$\pm~$0.004.30$\pm~$0.40 0.15$\pm~$0.000.58$\pm~$0.010.34$\pm~$0.02
    MIMLfast0.49$\pm~$0.034.55$\pm~$0.150.21$\pm~$0.020.69$\pm~$0.050.38$\pm~$0.02
    Mus musculus TR-MIMLNN0.52$\pm~$0.004.35$\pm~$0.00.22$\pm~$0.030.68$\pm~$0.020.39$\pm~$0.01
    MIMLNN0.48$\pm~$0.014.79$\pm~$0.030.27$\pm~$0.010.65$\pm~$0.000.41$\pm~$0.00
    TR-MIMLSVM0.50$\pm~$0.04.55$\pm~$0.120.22$\pm~$0.040.63$\pm~$0.010.41$\pm~$0.02
    MIMLSVM0.47$\pm~$0.014.67$\pm~$0.120.28$\pm~$0.010.64$\pm~$0.020.44$\pm~$0.02
    TR-MIMLfast0.54$\pm~$0.024.64$\pm~$0.23 0.17$\pm~$0.000.64$\pm~$0.02 0.39$\pm~$0.04
    MIMLfast0.49$\pm~$0.025.03$\pm~$0.250.25$\pm~$0.000.67$\pm~$0.030.40$\pm~$0.02
    Rattus norvegicus TR-MIMLNN0.50$\pm~$0.034.93$\pm~$0.200.20$\pm~$0.010.66$\pm~$0.020.40$\pm~$0.01
    MIMLNN0.46$\pm~$0.015.36$\pm~$0.130.23$\pm~$0.010.71$\pm~$0.020.46$\pm~$0.02
    TR-MIMLSVM0.52$\pm~$0.014.73$\pm~$0.050.21$\pm~$0.020.65$\pm~$0.010.42$\pm~$0.01
    MIMLSVM0.49$\pm~$0.004.82$\pm~$0.030.25$\pm~$0.010.70$\pm~$0.010.46$\pm~$0.01
    TR-MIMLfast0.62$\pm~$0.024.52$\pm~$0.10 0.18$\pm~$0.000.53$\pm~$0.020.37$\pm~$0.02
    MIMLfast0.55$\pm~$0.034.87$\pm~$0.280.24$\pm~$0.020.60$\pm~$0.030.35$\pm~$0.02
    Saccharomyces cerevisiae TR-MIMLNN0.59$\pm~$0.015.02$\pm~$0.100.19$\pm~$0.040.60$\pm~$0.010.43$\pm~$0.02
    MIMLNN0.51$\pm~$0.005.49$\pm~$0.100.19$\pm~$0.010.64$\pm~$0.000.46$\pm~$0.00
    TR-MIMLSVM0.52$\pm~$0.014.77$\pm~$0.030.19$\pm~$0.010.62$\pm~$0.030.40$\pm~$0.01
    MIMLSVM0.49$\pm~$0.014.75$\pm~$0.000.20$\pm~$0.010.66$\pm~$0.010.46$\pm~$0.02

    a) 每个评价指标上最好的结果用粗体表示.

  • Table 4   Performance comparison on Geobacter sulfurreducens (target domain) by AdaBoost andLogistic Regression (LR) learning methods with or without transfer learning.Results demonstrate that transfer learning can improve the performance ofprotein function prediction on this task
    Source domainMethodAP ($~\uparrow~)$ CV ($~\downarrow~)$ HL ($~\downarrow~)$ OE ($~\downarrow~)$ RL ($~\downarrow~)$
    Mus musculusTrAdaBoost0.58$\pm~$0.014.10$\pm~$0.030.23$\pm~$0.010.61$\pm~$0.010.33$\pm~$0.01
    AdaBoost0.47$\pm~$0.014.51$\pm~$0.040.28$\pm~$0.010.69$\pm~$0.010.41$\pm~$0.01
    DALR0.44$\pm~$0.024.24$\pm~$0.020.26$\pm~$0.000.66$\pm~$0.000.34$\pm~$0.02
    LR0.35$\pm~$0.034.32$\pm~$0.050.27$\pm~$0.020.82$\pm~$0.010.45$\pm~$0.02
    Rattus norvegicusTrAdaBoost0.45$\pm~$0.034.23$\pm~$0.020.27$\pm~$0.010.65$\pm~$0.020.33$\pm~$0.01
    AdaBoost0.36$\pm~$0.024.31$\pm~$0.030.28$\pm~$0.030.83$\pm~$0.030.45$\pm~$0.02
    DALR0.45$\pm~$0.014.11$\pm~$0.030.9$\pm~$0.010.78$\pm~$0.010.46$\pm~$0.04
    LR0.32$\pm~$0.014.41$\pm~$0.050.30$\pm~$0.010.88$\pm~$0.000.54$\pm~$0.01
    Saccharomyces cerevisiaeTrAdaBoost0.44$\pm~$0.014.11$\pm~$0.020.28$\pm~$0.010.77$\pm~$0.010.46$\pm~$0.04
    AdaBoost0.32$\pm~$0.014.41$\pm~$0.050.30$\pm~$0.010.88$\pm~$0.010.54$\pm~$0.01
    DALR0.56$\pm~$0.084.35$\pm~$0.010.28$\pm~$0.020.61$\pm~$0.000.32$\pm~$0.01
    LR0.47$\pm~$0.024.48$\pm~$0.000.33$\pm~$0.010.67$\pm~$0.010.34$\pm~$0.00

    a) 每个评价指标上最好的结果用粗体表示.

  • Table 5   Performance comparison on Azotobacter vinelandii (target domain) by AdaBoost andLogistic Regression (LR) learning methods with or without transfer learning.Results demonstrate that transfer learning can improve the performance ofprotein function prediction on this task
    Source domainMethodAP ($~\uparrow~)$ CV ($~\downarrow~)$ HL ($~\downarrow~)$ OE ($~\downarrow~)$ RL ($~\downarrow~)$
    Mus musculusTrAdaBoost0.52$\pm~$0.004.07$\pm~$0.010.28$\pm~$0.000.55$\pm~$0.000.40$\pm~$0.00
    AdaBoost0.50$\pm~$0.004.77$\pm~$0.030.36$\pm~$0.000.69$\pm~$0.000.42$\pm~$0.00
    DALR0.53$\pm~$0.004.47$\pm~$0.020.26$\pm~$0.010.55$\pm~$0.000.38$\pm~$0.00
    LR0.48$\pm~$0.014.79$\pm~$0.050.31$\pm~$0.020.67$\pm~$0.030.41$\pm~$0.02
    Rattus norvegicusTrAdaBoost0.57$\pm~$0.034.25$\pm~$0.000.27$\pm~$0.010.62$\pm~$0.010.33$\pm~$0.01
    AdaBoost0.46$\pm~$0.024.49$\pm~$0.010.34$\pm~$0.020.66$\pm~$0.020.35$\pm~$0.02
    DALR0.53$\pm~$0.014.15$\pm~$0.010.17$\pm~$0.000.60$\pm~$0.010.36$\pm~$0.09
    LR0.44$\pm~$0.004.61$\pm~$0.050.19$\pm~$0.010.67$\pm~$0.000.39$\pm~$0.00
    Saccharomyces cerevisiaeTrAdaBoost0.49$\pm~$0.014.43$\pm~$0.010.30$\pm~$0.000.60$\pm~$0.010.41$\pm~$0.04
    AdaBoost0.45$\pm~$0.014.69$\pm~$0.090.37$\pm~$0.010.66$\pm~$0.010.48$\pm~$0.01
    DALR0.44$\pm~$0.024.24$\pm~$0.020.26$\pm~$0.000.66$\pm~$0.000.34$\pm~$0.01
    LR0.35$\pm~$0.014.33$\pm~$0.070.28$\pm~$0.000.84$\pm~$0.010.44$\pm~$0.01

    a) 每个评价指标上最好的结果用粗体表示.

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1