Dense and accurate whole-chromosome haplotyping of individual genomes
ARTICLE
DOI: 10.1038/s41467-017-01389-4
OPEN
Dense and accurate whole-chromosome
haplotyping of individual genomes
1234567890
David Porubsky1,8, Shilpa Garg2,3,4, Ashley D. Sanders5,6, Jan O. Korbel
Peter M. Lansdorp1,6,7 & Tobias Marschall 2,3
5, Victor Guryev1,
The diploid nature of the human genome is neglected in many analyses done today, where a
genome is perceived as a set of unphased variants with respect to a reference genome. This
lack of haplotype-level analyses can be explained by a lack of methods that can produce
dense and accurate chromosome-length haplotypes at reasonable costs. Here we introduce
an integrative phasing strategy that combines global, but sparse haplotypes obtained from
strand-specific single-cell sequencing (Strand-seq) with dense, yet local, haplotype information available through long-read or linked-read sequencing. We provide comprehensive
guidance on the required sequencing depths and reliably assign more than 95% of alleles
(NA12878) to their parental haplotypes using as few as 10 Strand-seq libraries in combination with 10-fold coverage PacBio data or, alternatively, 10X Genomics linked-read
sequencing data. We conclude that the combination of Strand-seq with different technologies
represents an attractive solution to chart the genetic variation of diploid genomes.
1 European Research Institute for the Biology of Ageing, University Medical Center Groningen, University of Groningen, Building 3226, 9713 AV Groningen,
The Netherlands. 2 Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, 66123 Saarbrücken, Germany. 3 Max Planck Institute for
Informatics, Saarland Informatics Campus E1.4, 66123 Saarbrücken, Germany. 4 Graduate School of Computer Science, Saarland University, Saarland
Informatics Campus E1.3, 66123 Saarbrücken, Germany. 5 European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Meyerhofstraße 1, 69117
Heidelberg, Germany. 6 Terry Fox Laboratory, BC Cancer Agency, 601 West 10th Avenue, Vancouver, BC V5Z 1L3, Canada. 7 Department of Medical
Genetics, University of British Columbia, 2350 Health Science Mall, Vancouver, BC V6T 1Z3, Canada. 8Present address: Max Planck Institute for Informatics,
Saarbrücken, Germany. David Porubsky, Shilpa Garg and Ashley D. Sanders contributed equally to this work. Correspondence and requests for materials
should be addressed to T.M. (email: )
NATURE COMMUNICATIONS | 8: 1293
| DOI: 10.1038/s41467-017-01389-4 | www.nature.com/naturecommunications
1
ARTICLE
NATURE COMMUNICATIONS | DOI: 10.1038/s41467-017-01389-4
H
experimental protocols9–13. Sequencing technologies sample the
human genome in the form of relatively short molecules (reads)
and every read that spans at least two heterozygous variants can
essentially be considered as a “mini haplotype” that can be
assembled into longer haplotype segments by partially overlapping reads spanning the same variable locus4. To this end,
haplotype-informative reads need to be partitioned into two
disjoint sets that represent the two haplotypes. This process,
however, is complicated by errors in sequencing as well as
genotyping. For these reasons, assembling haplotypes directly
from sequencing data is computationally challenging, and the
resulting optimization problems are provenly hard14,15.
Notwithstanding, a number of computational approaches for
read-based phasing have recently been developed16 and,
particularly, progress on fixed-parameter tractable algorithms has
enabled solving read-based phasing in practice17–19, for instance
through the implementations available in the software package
WhatsHap20. Beyond phasing reads aligned to a reference
genome, various approaches for haplotype-resolved de novo
assembly have been explored21–25.
However, all approaches to reconstruct haplotypes from
sequencing reads, be it reference-based or reference-free, come
with the intrinsic limitation that the distance between subsequent
heterozygous markers can be larger than the read length itself.
While long-read sequencing (such as PacBio SMRT26 and Oxford
NanoPore MinION27), or linked-read data (such as those
provided by 10X Genomics28) help to mitigate this issue, these
technologies fail to phase over longer stretches of homozygosity,
repeat-rich areas including segmental duplications, and centromeres. Thus, specialized techniques that enable homologous
chromosomes to be discriminated are required to physically
uman genomes are diploid and possess two copies of each
chromosome—one paternal and one maternal copy. At
the DNA sequence level, these two homologous copies
differ at a number of loci along each chromosome. Such heterozygous variants include single nucleotide variants (SNVs),
short indels, as well as larger structural variants such as deletions,
duplications, or inversions that change the copy number or
orientation of segments of the genome. Discriminating and
phasing alleles to their respective parental homolog is valuable in
many areas of human genetics. For instance, resolving haplotype
structure is required to track inheritance in human pedigrees and
populations1, map regions of meiotic recombination2,3, identify
variant-disease associations4, detect instances of compound heterozygosity, and study allele-specific events like DNA methylation
or gene expression5. In particular, long-range haplotype information is needed to systematically study epistatic interactions
between variants in enhancers and variants in their target genes
or their promotors. This is critical as many variants that have
been linked to traits in genome-wide association studies reside in
(super) enhancers6 and enhancer-specific variants can show
epistatic effects among one another7, as well as with their target
genes that are beyond the reach of linkage disequilibrium8. To
better understand these epistatic interactions, we must move
beyond merely locating variant alleles and additionally study their
functional relationships over long distances. Constructing
genome-wide chromosome-length haplotypes is therefore the
clear next step to build a more complete picture of genome
architecture and function.
Currently, methods used to chart the unique variation of
individual human genomes rely largely on second- and thirdgeneration DNA sequencing and can include specialized
b
SNV density
Cost/labor
Read-based
phasing
SNV density
Cost/labor
Experimental
phasing
% of covered benchmark SNVs
a
Homozygosity region
Heterozygous alleles
Centromere
Unknown phase
Gene
Enhancer
75
50
25
98.8%
97.2%
77.8%
57.6%
0
PacBio only 10xGen only Illumina only StrandS only
c
d
SNVs in the # of phased
largest segm. segments
199
57.6%
1
0.3
% of switch errors
4.66%
10,000
1927
1
1.25%
100
Illumina
- 15994 bp
PacBio
- 1711716 bp
10xGen
- 8582136 bp
Strand-seq - 248671482 bp
0
StrandS
only
30,204
60
PacBio
only
10xGen
only
0.06%
40
Illumina
only
20
Chromosome 1 example
Length of the longest haplotype (bp) :
100
0.2
0.1
0.13%
(...truncated)