logo

SCIENCE CHINA Life Sciences, https://doi.org/10.1007/s11427-019-9551-7

Building a sequence map of the pig pan-genome from multiple de novo assemblies and Hi-C data

More info
  • ReceivedJan 7, 2019
  • AcceptedApr 3, 2019
  • PublishedJul 8, 2019

Abstract

Pigs were domesticated independently in the Near East and China, indicating that a single reference genome from one individual is unable to represent the full spectrum of divergent sequences in pigs worldwide. Therefore, 12 de novo pig assemblies from Eurasia were compared in this study to identify the missing sequences from the reference genome. As a result, 72.5 Mb of non-redundant sequences (~3% of the genome) were found to be absent from the reference genome (Sscrofa11.1) and were defined as pan-sequences. Of the pan-sequences, 9.0 Mb were dominant in Chinese pigs, in contrast with their low frequency in European pigs. One sequence dominant in Chinese pigs contained the complete genic region of the tazarotene-induced gene 3 (TIG3) gene which is involved in fatty acid metabolism. Using flanking sequences and Hi-C based methods, 27.7% of the sequences could be anchored to the reference genome. The supplementation of these sequences could contribute to the accurate interpretation of the 3D chromatin structure. A web-based pan-genome database was further provided to serve as a primary resource for exploration of genetic diversity and promote pig breeding and biomedical research.


Funded by

the National Natural Science Foundation of China(31822052,31572381)


Acknowledgment

This work was supported by the National Natural Science Foundation of China (31822052 and 31572381) to Y.J and the Science & Technology Support Program of Sichuan (2016NYZ0042 and 2017NZDZX0002) to M.Z.L. We thank the High Performance Computing platform of Northwest A&F University for their assistance with the computing.


Interest statement

The author(s) declare that they have no conflict of interest.


Supplement

SUPPORTING INFORMATION

Figure S1 Comparison of contig N50 among pig, human and other animal reference genomes.

Figure S2 Geographic distributions of the original pig breeds collected in this study.

Figure S3 The expression of TIG3 in subcutaneous adipose tissue (light red background) and other tissues (light blue background) of pigs harboring this gene.

Figure S4 Protein alignment of TIG3 in mammals (pig, dog, panda and human), chicken, alligator and zebrafish.

Figure S5 Selection test for TIG3 in pig and other species.

Figure S6 One pan-sequence covers partial genic regions of ZNF622, representing a new splicing event.

Figure S7 SNP density in TAD boundary (blue) and TAD internal (red) region in five samples digested by MboI enzyme.

Figure S8 Schematic diagram showing our strategy in identifying potential putative enhancer in pan-sequences.

Figure S9 An IGV view of Illumina reads at two example regions in chr2 and chr1.

Figure S10 Comparison of RNA-seq read mapping quality using the pan-genome versus Sscrofa11.1.

Figure S11 Comparison of RNA-seq read mapping rate using the pan-genome versus Sscrofa11.1.

Figure S12 Transcriptional potential of the pan-sequences.

Figure S13 Correlation coefficient of Hi-C data from different samples.

Table S1 Sample information of Hi-C data

Table S2 Detailed statistics of assemblies used in pig-pangenome construction

Table S3 Enriched KEGG functional classes among genes that annotated in pan-sequences

Table S4 Summary statistics of whole genome resequencing data in this study

Table S5 The presence and absence of pan-sequences in 87 resequencing samples

Table S6 The frequency distribution of population-specific pan-sequences in Chinese pigs and European pigs

Table S7 Summary statistics of the Hi-C data used in our experiment

Table S8 The anchored location of pan-sequences to Sscrofa11.1 by flanking sequences based and Hi-C based methods

Table S9 List of pan-sequences which shown interaction with a known promoter identified using Hi-C analysis

Table S10 Enriched KEGG functional classes among genes that might be regulated by the putative enhancers of pan-sequences

Table S11 Summary of adjusted SNPs after addition of pan-sequences

Supplementary Dataset 1 The sequences of pig pan-genome

Supplementary Dataset 2 The male-specific pan-sequences

Supplementary Dataset 3 The annotation of pan-sequences

Supplementary Dataset 4 The copy number variation dataset of pig pan-genome

The supporting information is available online at http://life.scichina.com and https://link.springer.com. The supporting materials are published as submitted, without typesetting or editing. The responsibility for scientific accuracy and content remains entirely with the authors.


References

[1] Ai H., Fang X., Yang B., Huang Z., Chen H., Mao L., Zhang F., Zhang L., Cui L., He W., et al. Adaptation and possible ancient interspecies introgression in pigs identified by whole-genome sequencing. Nat Genet, 2015, 47: 217-225 CrossRef PubMed Google Scholar

[2] Arumemi F., Bayles I., Paul J., Milcarek C.. Shared and discrete interacting partners of ELL1 and ELL2 by yeast two-hybrid assay. ABB, 2013, 04: 774-780 CrossRef Google Scholar

[3] Blanco, E., Parra, G., and Guigo, R. (2007). Using geneid to identify genes. Curr Protoc Bioinformatics Chapter 4, Unit 4.3. Google Scholar

[4] Burge C.B., Karlin S.. Finding the genes in genomic DNA. Curr Opin Struct Biol, 1998, 8: 346-354 CrossRef Google Scholar

[5] Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J., Bealer K., Madden T.L.. BLAST+: architecture and applications. BMC BioInf, 2009, 10: 421 CrossRef PubMed Google Scholar

[6] Casper J., Zweig A.S., Villarreal C., Tyner C., Speir M.L., Rosenbloom K.R., Raney B.J., Lee C.M., Lee B.T., Karolchik D., et al. OUP accepted manuscript. Nucleic Acids Res, 2017, CrossRef PubMed Google Scholar

[7] Christopoulos A., Ligoudistianou C., Bethanis P., Gazouli M.. Successful use of adipose-derived mesenchymal stem cells to correct a male breast affected by Poland Syndrome: a case report. J Surg Case Rep, 2018, 2018(7): rjy151 CrossRef PubMed Google Scholar

[8] Dixon J.R., Selvaraj S., Yue F., Kim A., Li Y., Shen Y., Hu M., Liu J.S., Ren B.. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature, 2012, 485: 376-380 CrossRef PubMed ADS Google Scholar

[9] Doerks T., Copley R.R., Schultz J., Ponting C.P., Bork P.. Systematic identification of novel protein domain families associated with nuclear functions. Genome Res, 2002, 12: 47-56 CrossRef PubMed Google Scholar

[10] Dong P., Tu X., Chu P.Y., Lü P., Zhu N., Grierson D., Du B., Li P., Zhong S.. 3D chromatin architecture of large plant genomes determined by local A/B compartments. Mol Plant, 2017, 10: 1497-1509 CrossRef PubMed Google Scholar

[11] Durand N.C., Shamim M.S., Machol I., Rao S.S.P., Huntley M.H., Lander E.S., Aiden E.L.. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst, 2016, 3: 95-98 CrossRef PubMed Google Scholar

[12] Fang X., Mou Y., Huang Z., Li Y., Han L., Zhang Y., Feng Y., Chen Y., Jiang X., Zhao W., et al. The sequence and analysis of a Chinese pig genome. Gigascience, 2012, 1: 16 CrossRef PubMed Google Scholar

[13] Frantz L.A.F., Schraiber J.G., Madsen O., Megens H.J., Cagan A., Bosse M., Paudel Y., Crooijmans R.P.M.A., Larson G., Groenen M.A.M.. Evidence of long-term gene flow and selection during domestication from analyses of Eurasian wild and domestic pig genomes. Nat Genet, 2015, 47: 1141-1148 CrossRef PubMed Google Scholar

[14] Frazee A.C., Pertea G., Jaffe A.E., Langmead B., Salzberg S.L., Leek J.T.. Ballgown bridges the gap between transcriptome assembly and expression analysis. Nat Biotechnol, 2015, 33: 243-246 CrossRef PubMed Google Scholar

[15] Golicz A.A., Bayer P.E., Barker G.C., Edger P.P., Kim H.R., Martinez P.A., Chan C.K.K., Severn-Ellis A., McCombie W.R., Parkin I.A.P., et al. The pangenome of an agronomically important crop plant Brassica oleracea. Nat Commun, 2016, 7: 13390 CrossRef PubMed ADS Google Scholar

[16] Gordon S.P., Contreras-Moreira B., Woods D.P., Des Marais D.L., Burgess D., Shu S., Stritt C., Roulin A.C., Schackwitz W., Tyler L., et al. Extensive gene content variation in the Brachypodium distachyon pan-genome correlates with population structure. Nat Commun, 2017, 8: 2184 CrossRef PubMed ADS Google Scholar

[17] Groenen M.A.M., Archibald A.L., Uenishi H., Tuggle C.K., Takeuchi Y., Rothschild M.F., Rogel-Gaillard C., Park C., Milan D., Megens H.J., et al. Analyses of pig genomes provide insight into porcine demography and evolution. Nature, 2012, 491: 393-398 CrossRef PubMed ADS Google Scholar

[18] Guirao-Rico S., Ramirez O., Ojeda A., Amills M., Ramos-Onsins S.E.. Porcine Y-chromosome variation is consistent with the occurrence of paternal gene flow from non-Asian to Asian populations. Heredity, 2018, 120: 63-76 CrossRef PubMed Google Scholar

[19] Hirsch C.N., Foerster J.M., Johnson J.M., Sekhon R.S., Muttoni G., Vaillancourt B., Peñagaricano F., Lindquist E., Pedraza M.A., Barry K., et al. Insights into the maize pan-genome and pan-transcriptome. Plant Cell, 2014, 26: 121-135 CrossRef PubMed Google Scholar

[20] Jeong H., Song K.D., Seo M., Caetano-Anollés K., Kim J., Kwak W., Oh J.D., Kim E.S., Jeong D.K., Cho S., et al. Exploring evidence of positive selection reveals genetic basis of meat quality traits in Berkshire pigs through whole genome sequencing. BMC Genet, 2015, 16: 104 CrossRef PubMed Google Scholar

[21] Kent W.J.. BLAT—The BLAST-like alignment tool. Genome Res, 2002, 12: 656-664 CrossRef PubMed Google Scholar

[22] Kim D., Langmead B., Salzberg S.L.. HISAT: a fast spliced aligner with low memory requirements. Nat Methods, 2015, 12: 357-360 CrossRef PubMed Google Scholar

[23] Knight P.A., Ruiz D.. A fast algorithm for matrix balancing. IMA J Numer Anal, 2013, 33: 1029-1047 CrossRef Google Scholar

[24] Kumar S., Stecher G., Tamura K.. MEGA7: molecular evolutionary genetics analysis version 7.0 for bigger datasets. Mol Biol Evol, 2016, 33: 1870-1874 CrossRef PubMed Google Scholar

[25] Larson G., Dobney K., Albarella U., Fang M., Matisoo-Smith E., Robins J., Lowden S., Finlayson H., Brand T., Willerslev E., et al. Worldwide phylogeography of wild boar reveals multiple centers of pig domestication. Science, 2005, 307: 1618-1621 CrossRef PubMed ADS Google Scholar

[26] Leung D., Jung I., Rajagopal N., Schmitt A., Selvaraj S., Lee A.Y., Yen C.A., Lin S., Lin Y., Qiu Y., et al. Integrative analysis of haplotype-resolved epigenomes across human tissues. Nature, 2015, 518: 350-354 CrossRef PubMed ADS Google Scholar

[27] Li H., Durbin R.. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 2009, 25: 1754-1760 CrossRef PubMed Google Scholar

[28] Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., Durbin R.. The sequence alignment/map format and SAMtools. Bioinformatics, 2009, 25: 2078-2079 CrossRef PubMed Google Scholar

[29] Li M., Chen L., Tian S., Lin Y., Tang Q., Zhou X., Li D., Yeung C.K.L., Che T., Jin L., et al. Comprehensive variation discovery and recovery of missing sequence in the pig genome using multiple de novo assemblies. Genome Res, 2017, 27: 865-874 CrossRef PubMed Google Scholar

[30] Li M., Tian S., Jin L., Zhou G., Li Y., Zhang Y., Wang T., Yeung C.K.L., Chen L., Ma J., et al. Genomic analyses identify distinct patterns of selection in domesticated pigs and Tibetan wild boars. Nat Genet, 2013, 45: 1431-1438 CrossRef PubMed Google Scholar

[31] Li R., Li Y., Zheng H., Luo R., Zhu H., Li Q., Qian W., Ren Y., Tian G., Li J., et al. Building the sequence map of the human pan-genome. Nat Biotechnol, 2010, 28: 57-63 CrossRef PubMed Google Scholar

[32] Li Y., Zhou G., Ma J., Jiang W., Jin L., Zhang Z., Guo Y., Zhang J., Sui Y., Zheng L., et al. De novo assembly of soybean wild relatives for pan-genome analysis of diversity and agronomic traits. Nat Biotechnol, 2014, 32: 1045-1052 CrossRef PubMed Google Scholar

[33] Lieberman-Aiden E., van Berkum N.L., Williams L., Imakaev M., Ragoczy T., Telling A., Amit I., Lajoie B.R., Sabo P.J., Dorschner M.O., et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science, 2009, 326: 289-293 CrossRef PubMed ADS Google Scholar

[34] McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., Daly M., et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res, 2010, 20: 1297-1303 CrossRef PubMed Google Scholar

[35] Monat C., Pera B., Ndjiondjop M.N., Sow M., Tranchant-Dubreuil C., Bastianelli L., Ghesquière A., Sabot F.. de novo assemblies of three Oryza glaberrima accessions provide first insights about pan-genome of African rices. Genome Biol Evol, 2016, : evw253 CrossRef PubMed Google Scholar

[36] Morgulis A., Gertz E.M., Schäffer A.A., Agarwala R.. WindowMasker: window-based masker for sequenced genomes. Bioinformatics, 2006, 22: 134-141 CrossRef PubMed Google Scholar

[37] Neafsey D.E., Waterhouse R.M., Abai M.R., Aganezov S.S., Alekseyev M.A., Allen J.E., Amon J., Arcà B., Arensburger P., Artemov G., et al. Highly evolvable malaria vectors: The genomes of 16 Anopheles mosquitoes. Science, 2015, 347: 1258522-43 CrossRef PubMed ADS Google Scholar

[38] Pertea M., Pertea G.M., Antonescu C.M., Chang T.C., Mendell J.T., Salzberg S.L.. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol, 2015, 33: 290-295 CrossRef PubMed Google Scholar

[39] Rao S.S.P., Huntley M.H., Durand N.C., Stamenova E.K., Bochkov I.D., Robinson J.T., Sanborn A.L., Machol I., Omer A.D., Lander E.S., et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell, 2014, 159: 1665-1680 CrossRef PubMed Google Scholar

[40] Ron G., Globerson Y., Moran D., Kaplan T.. Promoter-enhancer interactions identified from Hi-C data using probabilistic models and hierarchical topological domains. Nat Commun, 2017, 8: 2237 CrossRef PubMed ADS Google Scholar

[41] Schatz M.C., Maron L.G., Stein J.C., Hernandez Wences A., Gurtowski J., Biggers E., Lee H., Kramer M., Antoniou E., Ghiban E., et al. Whole genome de novo assemblies of three divergent strains of rice, Oryza sativa, document novel gene space of aus and indica. Genome Biol, 2014, 15: 506 CrossRef PubMed Google Scholar

[42] Shen W., Le S., Li Y., Hu F.. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS ONE, 2016, 11: e0163962 CrossRef PubMed ADS Google Scholar

[43] Sherman R.M., Forman J., Antonescu V., Puiu D., Daya M., Rafaels N., Boorgula M.P., Chavan S., Vergara C., Ortega V.E., et al. Assembly of a pan-genome from deep sequencing of 910 humans of African descent. Nat Genet, 2019, 51: 30-35 CrossRef PubMed Google Scholar

[44] Stanke M., Keller O., Gunduz I., Hayes A., Waack S., Morgenstern B.. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res, 2006, 34: W435-W439 CrossRef PubMed Google Scholar

[45] Sun C., Hu Z., Zheng T., Lu K., Zhao Y., Wang W., Shi J., Wang C., Lu J., Zhang D., et al. RPAN: rice pan-genome browser for ∼3000 rice genomes. Nucleic Acids Res, 2017, 45: 597-605 CrossRef PubMed Google Scholar

[46] Uyama T., Ichi I., Kono N., Inoue A., Tsuboi K., Jin X.H., Araki N., Aoki J., Arai H., Ueda N.. Regulation of peroxisomal lipid metabolism by catalytic activity of tumor suppressor H-rev107. J Biol Chem, 2012, 287: 2706-2718 CrossRef PubMed Google Scholar

[47] Vaccari C.M., Romanini M.V., Musante I., Tassano E., Gimelli S., Divizia M.T., Torre M., Morovic C.G., Lerone M., Ravazzolo R., et al. De novo deletion of chromosome 11q12.3 in monozygotic twins affected by Poland Syndrome. BMC Med Genet, 2014, 15: 63 CrossRef PubMed Google Scholar

[48] Wang X., Zheng Z., Cai Y., Chen T., Li C., Fu W., Jiang Y.. CNVcaller: highly efficient and widely applicable software for detecting copy number variations in large populations. GigaScience, 2017, 6 CrossRef PubMed Google Scholar

[49] Wong K.H.Y., Levy-Sakin M., Kwok P.Y.. De novo human genome assemblies reveal spectrum of alternative haplotypes in diverse populations. Nat Commun, 2018, 9: 3040 CrossRef PubMed ADS Google Scholar

[50] Xiao S., Xie D., Cao X., Yu P., Xing X., Chen C.C., Musselman M., Xie M., West F.D., Lewin H.A., et al. Comparative epigenomic annotation of regulatory DNA. Cell, 2012, 149: 1381-1392 CrossRef PubMed Google Scholar

[51] Xie C., Mao X., Huang J., Ding Y., Wu J., Dong S., Kong L., Gao G., Li C.Y., Wei L.. KOBAS 2.0: a web server for annotation and identification of enriched pathways and diseases. Nucleic Acids Res, 2011, 39: W316-W322 CrossRef PubMed Google Scholar

[52] Yan G., Zhang G., Fang X., Zhang Y., Li C., Ling F., Cooper D.N., Li Q., Li Y., van Gool A.J., et al. Genome sequencing and comparison of two nonhuman primate animal models, the cynomolgus and Chinese rhesus macaques. Nat Biotechnol, 2011, 29: 1019-1023 CrossRef PubMed Google Scholar

[53] Zhang Y., Liu T., Meyer C.A., Eeckhoute J., Johnson D.S., Bernstein B.E., Nussbaum C., Myers R.M., Brown M., Li W., et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol, 2008, 9: R137 CrossRef PubMed Google Scholar

[54] Zhao Q., Feng Q., Lu H., Li Y., Wang A., Tian Q., Zhan Q., Lu Y., Zhang L., Huang T., et al. Pan-genome analysis highlights the extent of genomic variation in cultivated and wild rice. Nat Genet, 2018, 50: 278-284 CrossRef PubMed Google Scholar

  • Figure 1

    Construction of the pig pan-genome and the characterization of pan-sequences. A, Maximum likelihood phylogenetic tree, sequence length, GC content and repeat composition of missing sequences identified in each individual assembly of 11 breeds (left to right). B, The total sequence length and breed-specific sequence length of each breed for non-redundant pan-sequences. C, Length distribution of all pan-sequences. (Wuzhishan pigs had the largest number of sequences because this individual is the only male among all the 11 assemblies and the sequencing platform of this individual differed from that used for other samples.)

  • Figure 2

    Pan-sequences validation and population-specific pattern. A, Homologue identification of pan-sequences in 10 mammalian genomes. Only the best hit was retained for each pan-sequence. B, An 87×87 matrix showing the number of shared pan-sequences among all the individuals by pairs. Each cell represents the number of shared pan-sequences by two individuals. See Table S3 in Supporting Information for the classification of each group. C, Genes contained in the pan-sequences. One pan-sequence of 14.3 kb harbours the complete genic region of TIG3. The four tracks at the bottom represent the reads mapping of whole-genome resequencing data of two samples (labelled “1” and “2”) and the inferred exons as well as their splicing isoforms based on RNA-seq (labelled “3” and “4”). D, The expression of TIG3 in 92 RNA-seq samples from 10 animals from China (light red background) and Europe (light blue background). The 10 animals corresponded to 10 of our 11 assemblies used in this study excluding the Wuzhishan assembly. E, The normalized read depth (NRD) of male-specific pan-sequences in each male. See Table S3 in Supporting Information for the classification of each group. (Only the sequences with the frequency ranging from 0.5 to 0.9 are shown.)

  • Figure 3

    The 3D spatial structure of the pan-genome. A, The distributions of the A/B compartment, TAD and anchored pan-sequences. B, The relative length-proportion of the A compartment over the B compartment in the pig genome (left) and the relative length-proportion of pan-sequences located in the A compartment over those located in the B compartment (right). C, The relative length-proportion of TAD boundary regions over TAD interior regions in Sscrofa11.1 (left) and the relative length-proportion of pan-sequences located in TAD boundary regions over TAD interior regions (right). D, An example of improving a 3D spatial structure after replacing the weakly interacting sequences with the non-reference pan-sequences. The interaction of pan-sequences with flanking sequences was supported by more read contacts than the original interaction of the counterparts in the genome with the flanking sequences.

  • Figure 4

    Improvements of genomic analyses by using the pan-genome. A, Comparison of the mapping ratio of resequencing data using the pan-genome versus Sscrofa11.1. B, Comparison of read-mapping quality using the pan-genome versus Sscrofa11.1. C, Comparison of corrected read-mapping depth using the pan-genome versus Sscrofa11.1. D, Improved read mapping using the pan-genome versus Sscrofa11.1 as viewed with IGV.

  • Figure 5

    The processing pipeline used to construct the PIGPAN database. PIGPAN integrated genomics, transcriptomics and regulatory data. Users can search for a gene symbol or a genomic region to obtain results in the form of an interactive table and graph. A, The system diagram of PIGPAN. B, The 17 tracks released against the pig pan-genome in our local UCSC Genome Browser server. C, One case showing the copy number difference of the KIT gene between European and Chinese pigs by using PIGPAN.

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1