SCIENCE CHINA Life Sciences, Volume 62 , Issue 4 : 579-593(2019) https://doi.org/10.1007/s11427-019-9482-0

Origination and evolution of orphan genes and de novo genes in the genome of Caenorhabditis elegans

More info
  • ReceivedJan 11, 2019
  • AcceptedJan 22, 2019
  • PublishedMar 21, 2019


Orphan genes that lack detectable homologues in other lineages could contribute to a variety of biological functions. However, their origination and function mechanisms remain largely unknown. Herein, through a comprehensive and systematic computational pipeline, we identified 893 orphan genes in the lineage of C. elegans, of which only a low fraction (0.9%) were derived from transposon elements. Six new protein-coding genes that de novo originated from non-coding DNA sequences in the genome of C. elegans were also identified. The authenticity and functionality of these orphan genes and de novo genes are supported by three lines of evidences, consisting of transcriptional data, and in silico proteomic data, and the fixation status data in wild populations. Orphan genes and de novo genes exhibited simple gene structures, such as, short in protein length, of fewer exons, and are frequently X-linked. RNA-seq data analysis showed these orphan genes are enriched with expression in embryo development and gonad, and their potential function in early development was further supported by gene ontology enrichment analysis results. Meanwhile, de novo genes were found to be with significant expression in gonad, and functional enrichment analysis of the co-expression genes of these de novo genes suggested they may be functionally involved in signaling transduction pathway and metabolism process. Our results presented the first systematic evidence on the evolution of orphan genes and de novo origin of genes in nematodes and their impacts on the functional and phenotypic evolution, and thus could shed new light on our appreciation of the importance of these new genes.

Funded by

EEgrid cluster of the University of Chicago. National Natural Science Foundation of China to WY Zhang(31600670)

National Science Foundation(NSF1051826)


We are grateful to Li Zhang for providing helpful suggestions on de novo gene identification analysis. Computing was supported by EEgrid cluster of the University of Chicago. This work was supported by National Natural Science Foundation of China (31600670 to W. Zhang, 31670851 to B. Shen).

Interest statement

The author(s) declare that they have no conflict of interest.



Figure S1ƒDistribution of protein potential assessment scores computed from CPAT. The red dashed line indicates the assessment score threshold of 0.4 to differentiate protein coding genes and non-coding genes, as indicated in the original methodology reference.

Table S1ƒDetailed information about identified orphan gens and de novo genes

Table S2ƒGeography distribution of 40 C. elegans wild strains

Table S3ƒSummary on the genetic variation from wild strains for both orphan genes and de novo genes

Table S4ƒTraining dataset used in the protein potential assessment procedure with CPAT

The supporting information is available online at http://life.scichina.com and https://link.springer.com. The supporting materials are published as submitted, without typesetting or editing. The responsibility for scientific accuracy and content remains entirely with the authors.


[1] Agarwala, R., Barrett, T., Beck, J., Benson, D.A., Bollin, C., Bolton, E., Bourexis, D., Brister, J.R., Bryant, S.H., Canese, K., et al. (2016). Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 44, D7–D19. Google Scholar

[2] Arnold, A., Rahman, M.M., Lee, M.C., Muehlhaeusser, S., Katic, I., Hess, D., Scheckel, C., Wright, J.E., Stetak, A., Boag, P.R., et al. (2014). Functional characterization of C. elegans Y-box-binding proteins reveals tissue-specific functions and a critical role in the formation of polysomes. Nucleic Acids Res 42, 13353–13369. Google Scholar

[3] Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., et al. Gene ontology: Tool for the unification of biology. Nat Genet, 2000, 25: 25-29 CrossRef PubMed Google Scholar

[4] Babraham Institute. (2013). FastQC: A quality control tool for high throughput sequence data. Babraham Bioinforma. Google Scholar

[5] Begun D.J., Lindfors H.A., Kern A.D., Jones C.D.. Evidence for de novo evolution of testis-expressed genes in the Drosophila yakuba/Drosophila erecta clade. Genetics, 2007, 176: 1131-1137 CrossRef PubMed Google Scholar

[6] Boutet, E., Lieberherr, D., Tognolli, M., Schneider, M., and Bairoch A. (2007). UniProtKB/Swiss-Prot. Methods Mol Biol 406, 89–112. Google Scholar

[7] Cai J., Zhao R., Jiang H., Wang W.. De novo origination of a new protein-coding gene in Saccharomyces cerevisiae. Genetics, 2008, 179: 487-496 CrossRef PubMed Google Scholar

[8] Castillo-Davis C.I., Hartl D.L.. Genome evolution and developmental constraint in Caenorhabditis elegans. Mol Biol Evol, 2002, 19: 728-735 CrossRef PubMed Google Scholar

[9] Chen S., Krinsky B.H., Long M.. New genes as drivers of phenotypic evolution. Nat Rev Genet, 2013, 14: 645-660 CrossRef PubMed Google Scholar

[10] Cho S., Jin S.W., Cohen A., Ellis R.E.. A phylogeny of Caenorhabditis reveals frequent loss of introns during nematode evolution. Genome Res, 2004, 14: 1207-1220 CrossRef PubMed Google Scholar

[11] Colbourne J.K., Pfrender M.E., Gilbert D., Thomas W.K., Tucker A., Oakley T.H., Tokishita S., Aerts A., Arnold G.J., Basu M.K., et al. The ecoresponsive genome of Daphnia pulex. Science, 2011, 331: 555-561 CrossRef PubMed ADS Google Scholar

[12] Cutter A.D.. Divergence times in Caenorhabditis and Drosophila inferred from direct estimates of the neutral mutation rate. Mol Biol Evol, 2008, 25: 778-786 CrossRef PubMed Google Scholar

[13] Dennis G., Sherman B.T., Hosack D.A., Yang J., Gao W., Lane H., Lempicki R.A.. DAVID: Database for annotation, visualization, and integrated discovery. Genome Biol, 2003, 4: R60 CrossRef Google Scholar

[14] Desiere F., Deutsch E.W., King N.L., Nesvizhskii A.I., Mallick P., Eng J., Chen S., Eddes J., Loevenich S.N., Aebersold R.. The PeptideAtlas project. Nucleic Acids Res, 2006, 34: D655-D658 CrossRef PubMed Google Scholar

[15] Donoghue M.T., Keshavaiah C., Swamidatta S.H., Spillane C.. Evolutionary origins of Brassicaceae specific genes in Arabidopsis thaliana. BMC Evol Biol, 2011, 11: 47 CrossRef PubMed Google Scholar

[16] Grün, D., Kirchner, M., Thierfelder, N., Stoeckius, M., Selbach, M., and Rajewsky, N. (2014). Conservation of mRNA and protein expression during development of C. elegans. Cell Rep 6, 565–577. Google Scholar

[17] Jacob F.. Evolution and tinkering. Science, 1977, 196: 1161-1166 CrossRef ADS Google Scholar

[18] Katju, V., and Lynch, M.. (2003). The structure and early evolution of recently arisen gene duplicates in the Caenorhabditis elegans genome. Genetics 165, 1793–1803. Google Scholar

[19] Kent W.J.. BLAT—The BLAST-like alignment tool. Genome Res, 2002, 12: 656-664 CrossRef PubMed Google Scholar

[20] Kiontke K., Gavin N.P., Raynes Y., Roehrig C., Piano F., Fitch D.H.A.. Caenorhabditis phylogeny predicts convergence of hermaphroditism and extensive intron loss. Proc Natl Acad Sci USA, 2004, 101: 9003-9008 CrossRef PubMed ADS Google Scholar

[21] Knowles D.G., McLysaght A.. Recent de novo origin of human protein-coding genes. Genome Res, 2009, 19: 1752-1759 CrossRef PubMed Google Scholar

[22] Krueger F. (2016). Trim Galore. Babraham Bioinforma. Google Scholar

[23] Langmead B., Salzberg S.L.. Fast gapped-read alignment with Bowtie 2. Nat Methods, 2012, 9: 357-359 CrossRef PubMed Google Scholar

[24] Levine M.T., Jones C.D., Kern A.D., Lindfors H.A., Begun D.J.. Novel genes derived from noncoding DNA in Drosophila melanogaster are frequently X-linked and exhibit testis-biased expression. Proc Natl Acad Sci USA, 2006, 103: 9935-9939 CrossRef PubMed ADS Google Scholar

[25] Li H., Durbin R.. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 2009, 25: 1754-1760 CrossRef PubMed Google Scholar

[26] Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., Durbin R.. The sequence alignment/map format and SAMtools. Bioinformatics, 2009a, 25: 2078-2079 CrossRef PubMed Google Scholar

[27] Li L., Foster C.M., Gan Q., Nettleton D., James M.G., Myers A.M., Wurtele E.S.. Identification of the novel protein QQS as a component of the starch metabolic network in Arabidopsis leaves. Plant J, 2009b, 58: 485-498 CrossRef PubMed Google Scholar

[28] Li C.Y., Zhang Y., Wang Z., Zhang Y., Cao C., Zhang P.W., Lu S.J., Li X.M., Yu Q., Zheng X., et al. A human-specific de novo protein-coding gene associated with human brain functions. PLoS Comput Biol, 2010, 6: e1000734 CrossRef PubMed ADS Google Scholar

[29] Long M., Betrán E., Thornton K., Wang W.. The origin of new genes: Glimpses from the young and old. Nat Rev Genet, 2003, 4: 865-875 CrossRef PubMed Google Scholar

[30] Lynch M., Conery J.S.. The evolutionary fate and consequences of duplicate genes. Science, 2000, 290: 1151-1155 CrossRef ADS Google Scholar

[31] Martens L., Hermjakob H., Jones P., Adamski M., Taylor C., States D., Gevaert K., Vandekerckhove J., Apweiler R.. PRIDE: The proteomics identifications database. Proteomics, 2005, 5: 3537-3545 CrossRef PubMed Google Scholar

[32] Mayer M.G., Rödelsperger C., Witte H., Riebesell M., Sommer R.J.. The orphan gene dauerless regulates dauer development and intraspecific competition in nematodes by copy number variation. PLoS Genet, 2015, 11: e1005146 CrossRef PubMed Google Scholar

[33] Murphy D.N., McLysaght A.. De novo origin of protein-coding genes in murine rodents. PLoS ONE, 2012, 7: e48650 CrossRef PubMed ADS Google Scholar

[34] Neme R., Tautz D.. Phylogenetic patterns of emergence of new genes support a model of frequent de novo evolution. BMC Genomics, 2013, 14: 117 CrossRef PubMed Google Scholar

[35] Obayashi T., Kinoshita K.. Rank of correlation coefficient as a comparable measure for biological significance of gene coexpression. DNA Res, 2009, 16: 249-260 CrossRef PubMed Google Scholar

[36] Obayashi T., Kinoshita K.. COXPRESdb: A database to compare gene coexpression in seven model animals. Nucleic Acids Res, 2011, 39: D1016-D1022 CrossRef PubMed Google Scholar

[37] Orgel L.E., Crick F.H.C.. Selfish DNA: the ultimate parasite. Nature, 1980, 284: 604-607 CrossRef ADS Google Scholar

[38] Palmieri N., Kosiol C., Schlötterer C.. The life cycle of Drosophila orphan genes. eLife, 2014, 3: e01311 CrossRef PubMed Google Scholar

[39] Pocock, R. (2004). A regulatory network of T-box genes and the even-skipped homologue vab-7 controls patterning and morphogenesis in C. elegans. Development 131, 2373–2385. Google Scholar

[40] Pruitt K.D., Tatusova T., Brown G.R., Maglott D.R.. NCBI Reference Sequences (RefSeq): Current status, new features and genome annotation policy. Nucleic Acids Res, 2012, 40: D130-D135 CrossRef PubMed Google Scholar

[41] Ritter, A.D., Shen, Y., Bass, J.F., Jeyaraj, S., Deplancke, B., Mukhopadhyay, A., Xu, J., Driscoll, M., Tissenbaum, H.A., and Walhout, A.J.M. (2013). Complex expression dynamics and robustness in C. elegans insulin networks. Genome Res 23, 954–965. Google Scholar

[42] Rödelsperger, C., Streit, A., and Sommer, R.J. (2013). Structure, function and evolution of the nematode genome. In eLS (Chichester, UK: John Wiley & Sons, Ltd). Google Scholar

[43] Rubin G.M.. Comparative genomics of the eukaryotes. Science, 2000, 287: 2204-2215 CrossRef ADS Google Scholar

[44] Rudel D., Kimble J.. Evolution of discrete Notch-like receptors from a distant gene duplication in Caenorhabditis. Evol Dev, 2002, 4: 319-333 CrossRef Google Scholar

[45] Stein L., Sternberg P., Durbin R., Thierry-Mieg J., Spieth J.. WormBase: network access to the genome and biology of Caenorhabditis elegans. Nucleic Acids Res, 2001, 29: 82-86 CrossRef Google Scholar

[46] Sun W., Zhao X.W., Zhang Z.. Identification and evolution of the orphan genes in the domestic silkworm, Bombyx mori. FEBS Lett, 2015, 589: 2731-2738 CrossRef PubMed Google Scholar

[47] Susumu O. (1970). Evolution by Gene Duplication (Springer). Google Scholar

[48] Tautz D., Domazet-Lošo T.. The evolutionary origin of orphan genes. Nat Rev Genet, 2011, 12: 692-702 CrossRef PubMed Google Scholar

[49] The C. elegans Sequencing Consortium. (1998). Genome sequence of the nematode Caenorhabditis elegans: A platform for investigating biology. Science 282, 2012–2018. Google Scholar

[50] Thompson O., Edgley M., Strasbourger P., Flibotte S., Ewing B., Adair R., Au V., Chaudhry I., Fernando L., Hutter H., et al. The million mutation project: a new approach to genetics in Caenorhabditis elegans. Genome Res, 2013, 23: 1749-1762 CrossRef PubMed Google Scholar

[51] Toll-Riera M., Bosch N., Bellora N., Castelo R., Armengol L., Estivill X., Mar Alba M.. Origin of primate orphan genes: A comparative genomics approach. Mol Biol Evol, 2009, 26: 603-612 CrossRef PubMed Google Scholar

[52] Wang Z., Gerstein M., Snyder M.. RNA-Seq: A revolutionary tool for transcriptomics. Nat Rev Genet, 2009, 10: 57-63 CrossRef PubMed Google Scholar

[53] Wang L., Park H.J., Dasari S., Wang S., Kocher J.P., Li W.. CPAT: Coding-potential assessment tool using an alignment-free logistic regression model. Nucleic Acids Res, 2013, 41: e74 CrossRef PubMed Google Scholar

[54] Williams, S. (1996). Pearson’s correlation coefficient. N Z Med J 109, 38. Google Scholar

[55] Wu D.D., Irwin D.M., Zhang Y.P.. De novo origin of human protein-coding genes. PLoS Genet, 2011, 7: e1002379 CrossRef PubMed Google Scholar

[56] Xiao W., Liu H., Li Y., Li X., Xu C., Long M., Wang S.. A rice gene of de novo origin negatively regulates pathogen-induced defense response. PLoS ONE, 2009, 4: e4603 CrossRef PubMed ADS Google Scholar

[57] Zhang Y.E., Vibranovski M.D., Krinsky B.H., Long M.. Age-dependent chromosomal distribution of male-biased genes in Drosophila. Genome Res, 2010a, 20: 1526-1533 CrossRef PubMed Google Scholar

[58] Zhang, Y.E., Vibranovski, M.D., Landback, P., Marais, G.A.B, and Long, M. (2010b). Chromosomal redistribution of male-biased genes in mammalian evolution with two bursts of gene gain on the X chromosome. PLoS Biol 8. Google Scholar

[59] Zhang Y.E., Landback P., Vibranovski M., Long M.. New genes expressed in human brains: Implications for annotating evolving genomes. Bioessays, 2012, 34: 982-991 CrossRef PubMed Google Scholar

[60] Zhang, W., Landback, P., Gschwend, A.R., Shen, B., and Long, M. (2015). New genes drive the evolution of gene interaction networks in the human and mouse genomes. Genome Biol 16. Google Scholar

[61] Zhao L., Saelao P., Jones C.D., Begun D.J.. Origin and spread of de novo genes in Drosophila melanogaster populations. Science, 2014, 343: 769-772 CrossRef PubMed ADS Google Scholar

  • Figure 1

    Computational pipeline to identify orphan genes and de novo protein-coding genes in the genome of C. elegans. Through BLASTP searches of C. elegans proteins against sibling Caenorhabditis species and other distantly related species, 893 orphan genes (proteins) were identified in the genome of C. elegans. After eliminating those genes containing transposable elements, the coding sequences of these remaining genes were used in BLAT searches to find non-translatable orthologous sequences in closely-related species and with common “disabler” found in the sibling species, in order to search for putative de novo genes.

  • Figure 2

    Sequence comparisons of 6 potential de novo genes in C. elegans and their homologous non-coding DNA sequences from closely-related species. For each panel, the upper layer shows part of conserved syntenic block flanking the focal gene in the genomes of C. elegans, one in-group species and out-group species. Only protein-coding genes are shown, and orthologous genes were indicated with double-edged arrows. Genes are indicated by solid rectangular boxes, and non-coding regions by dashed rectangular boxes. The lower layer shows the alignment of the coding regions of de novo genes and their orthologous regions in two related species. “*” indicates sites conserved in all three species, and those sites with conservation in two species are marked with “.”. Common “disabler” that caused the disruption protein translatability in two other species is labelled in red rectangular box. Splicing regions are marked with dashed lines in green.

  • Figure 3

    Sequence features of C. elegans orphan genes and de novo genes. This figure shows the distinct sequence features of orphan genes and de novo protein-coding genes compared with those of total genes in the genome of C. elegans, including protein length (A), fraction of low complexity protein regions (B), number of exons (C), and the average exon length (D). For genes with alternative isoforms, we only consider the one with the longest protein length. All the statistics analysis was done with Wilcox signed-rank test.

  • Figure 4

    Origination patterns of C. elegans orphan genes and de novo genes. A, Comparison of the percentages of genes with alternative splicing (AS) isoforms for orphan genes, de novo genes and total gene. B, Distribution of de novo genes originating from intergenic regions and intron regions. C, Comparison of the percentages of genes in each chromosome for de novo genes, orphan genes, and total gene. D, Comparison of the fractions of genes between autosome and sex Chromosome for de novo genes, orphan genes, and total genes.

  • Figure 5

    Gene expression patterns of C. elegans orphan genes. A, Fraction of orphan genes having expression in different developmental stages. B, Average normalized expression levels of orphan genes in different developmental stages. C, Fraction of orphan genes having expression in different tissues of adult stage. D, Average normalized expression levels of orphan genes in different tissues of adult stage.

  • Figure 6

    Gene expression patterns of C. elegans de novo genes. A, 3D plot for the expression levels of de novo gene in different developmental stages. B, 3D plot for the expression levels of de novo genes in different tissues of adult stage.

  • Table 1   Enriched GO terms for orphan genes

    GO category

    GO term

    P value

    P value after correction (Benjamini)

    Biological process (BP)

    Defense response to fungus



    Response to fungus



    Embryo development



    Molecular function (MF)

    DNA binding



    Nucleic acid binding



    Heterocyclic compound binding



    Organic cyclic compound binding



  • Table 2   Enriched GO terms for the co-expressed genes of genes


    Number ofco-expressed genes

    Top 3 enriched GO MF (molecular function) terms for co-expressed genes



    Carbohydrate binding (P=1.4×10–18)

    G-protein coupled olfactory receptor activity (P=3.5×10–15)

    G-protein coupled peptide receptor activity (P=3.1×10–7)



    Carbohydrate binding (P=1.1×10–12)

    Olfactory receptor activity (P=9.9×10–7)

    G-protein coupled olfactory receptor activity (P=9.7×10–6)



    Transmembrane receptor activity (P=1.2×10–34)

    Transmembrane signaling receptor activity (P=3.3×10–34)

    Receptor activity (P=4.6×10–33)



    Transmembrane signaling receptor activity (P=6.0×10–75)

    Signaling receptor activity(P=4.4×10–74)

    Transmembrane receptor activity (P=6.5×10–74)



    UDP-glycosyltransferase activity (P=1.7×10–4)

    Transferase activity, transferring hexosyl groups (P=2.7×10–4)

    Galactosylgalactosylxylosylprotein 3-beta-glucuronosyltransferase activity (P=3.2×10–4)



    ATP binding (P=4.4×10–2)

    Adenyl ribonucleotide binding (P=4.4×10–2)

    Adenyl nucleotide binding (P=4.4×10–2)

Copyright 2020  CHINA SCIENCE PUBLISHING & MEDIA LTD.  中国科技出版传媒股份有限公司  版权所有

京ICP备14028887号-23       京公网安备11010102003388号