Dense and accurate whole-chromosome haplotyping of individual genomes (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41467-017-01389-4.pdf

Dense and accurate whole-chromosome haplotyping of individual genomes

ARTICLE DOI: 10.1038/s41467-017-01389-4 OPEN Dense and accurate whole-chromosome haplotyping of individual genomes 1234567890 David Porubsky1,8, Shilpa Garg2,3,4, Ashley D. Sanders5,6, Jan O. Korbel Peter M. Lansdorp1,6,7 & Tobias Marschall 2,3 5, Victor Guryev1, The diploid nature of the human genome is neglected in many analyses done today, where a genome is perceived as a set of unphased variants with respect to a reference genome. This lack of haplotype-level analyses can be explained by a lack of methods that can produce dense and accurate chromosome-length haplotypes at reasonable costs. Here we introduce an integrative phasing strategy that combines global, but sparse haplotypes obtained from strand-speciﬁc single-cell sequencing (Strand-seq) with dense, yet local, haplotype information available through long-read or linked-read sequencing. We provide comprehensive guidance on the required sequencing depths and reliably assign more than 95% of alleles (NA12878) to their parental haplotypes using as few as 10 Strand-seq libraries in combination with 10-fold coverage PacBio data or, alternatively, 10X Genomics linked-read sequencing data. We conclude that the combination of Strand-seq with different technologies represents an attractive solution to chart the genetic variation of diploid genomes. 1 European Research Institute for the Biology of Ageing, University Medical Center Groningen, University of Groningen, Building 3226, 9713 AV Groningen, The Netherlands. 2 Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, 66123 Saarbrücken, Germany. 3 Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, 66123 Saarbrücken, Germany. 4 Graduate School of Computer Science, Saarland University, Saarland Informatics Campus E1.3, 66123 Saarbrücken, Germany. 5 European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Meyerhofstraße 1, 69117 Heidelberg, Germany. 6 Terry Fox Laboratory, BC Cancer Agency, 601 West 10th Avenue, Vancouver, BC V5Z 1L3, Canada. 7 Department of Medical Genetics, University of British Columbia, 2350 Health Science Mall, Vancouver, BC V6T 1Z3, Canada. 8Present address: Max Planck Institute for Informatics, Saarbrücken, Germany. David Porubsky, Shilpa Garg and Ashley D. Sanders contributed equally to this work. Correspondence and requests for materials should be addressed to T.M. (email: ) NATURE COMMUNICATIONS | 8: 1293 | DOI: 10.1038/s41467-017-01389-4 | www.nature.com/naturecommunications 1 ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-017-01389-4 H experimental protocols9–13. Sequencing technologies sample the human genome in the form of relatively short molecules (reads) and every read that spans at least two heterozygous variants can essentially be considered as a “mini haplotype” that can be assembled into longer haplotype segments by partially overlapping reads spanning the same variable locus4. To this end, haplotype-informative reads need to be partitioned into two disjoint sets that represent the two haplotypes. This process, however, is complicated by errors in sequencing as well as genotyping. For these reasons, assembling haplotypes directly from sequencing data is computationally challenging, and the resulting optimization problems are provenly hard14,15. Notwithstanding, a number of computational approaches for read-based phasing have recently been developed16 and, particularly, progress on ﬁxed-parameter tractable algorithms has enabled solving read-based phasing in practice17–19, for instance through the implementations available in the software package WhatsHap20. Beyond phasing reads aligned to a reference genome, various approaches for haplotype-resolved de novo assembly have been explored21–25. However, all approaches to reconstruct haplotypes from sequencing reads, be it reference-based or reference-free, come with the intrinsic limitation that the distance between subsequent heterozygous markers can be larger than the read length itself. While long-read sequencing (such as PacBio SMRT26 and Oxford NanoPore MinION27), or linked-read data (such as those provided by 10X Genomics28) help to mitigate this issue, these technologies fail to phase over longer stretches of homozygosity, repeat-rich areas including segmental duplications, and centromeres. Thus, specialized techniques that enable homologous chromosomes to be discriminated are required to physically uman genomes are diploid and possess two copies of each chromosome—one paternal and one maternal copy. At the DNA sequence level, these two homologous copies differ at a number of loci along each chromosome. Such heterozygous variants include single nucleotide variants (SNVs), short indels, as well as larger structural variants such as deletions, duplications, or inversions that change the copy number or orientation of segments of the genome. Discriminating and phasing alleles to their respective parental homolog is valuable in many areas of human genetics. For instance, resolving haplotype structure is required to track inheritance in human pedigrees and populations1, map regions of meiotic recombination2,3, identify variant-disease associations4, detect instances of compound heterozygosity, and study allele-speciﬁc events like DNA methylation or gene expression5. In particular, long-range haplotype information is needed to systematically study epistatic interactions between variants in enhancers and variants in their target genes or their promotors. This is critical as many variants that have been linked to traits in genome-wide association studies reside in (super) enhancers6 and enhancer-speciﬁc variants can show epistatic effects among one another7, as well as with their target genes that are beyond the reach of linkage disequilibrium8. To better understand these epistatic interactions, we must move beyond merely locating variant alleles and additionally study their functional relationships over long distances. Constructing genome-wide chromosome-length haplotypes is therefore the clear next step to build a more complete picture of genome architecture and function. Currently, methods used to chart the unique variation of individual human genomes rely largely on second- and thirdgeneration DNA sequencing and can include specialized b SNV density Cost/labor Read-based phasing SNV density Cost/labor Experimental phasing % of covered benchmark SNVs a Homozygosity region Heterozygous alleles Centromere Unknown phase Gene Enhancer 75 50 25 98.8% 97.2% 77.8% 57.6% 0 PacBio only 10xGen only Illumina only StrandS only c d SNVs in the # of phased largest segm. segments 199 57.6% 1 0.3 % of switch errors 4.66% 10,000 1927 1 1.25% 100 Illumina - 15994 bp PacBio - 1711716 bp 10xGen - 8582136 bp Strand-seq - 248671482 bp 0 StrandS only 30,204 60 PacBio only 10xGen only 0.06% 40 Illumina only 20 Chromosome 1 example Length of the longest haplotype (bp) : 100 0.2 0.1 0.13% (...truncated)