Chromosome-scale, haplotype-resolved assembly of human genomes
Letters
https://doi.org/10.1038/s41587-020-0711-0
Chromosome-scale, haplotype-resolved assembly
of human genomes
Shilpa Garg 1,2,3 ✉, Arkarachai Fungtammasan4, Andrew Carroll5, Mike Chou1, Anthony Schmitt6,
Xiang Zhou6, Stephen Mac6, Paul Peluso7, Emily Hatas7, Jay Ghurye8, Jared Maguire8,
Medhat Mahmoud 9, Haoyu Cheng2,3, David Heller 10, Justin M. Zook 11, Tobias Moemke12,
Tobias Marschall 12,13, Fritz J. Sedlazeck 9, John Aach1, Chen-Shan Chin 4 ✉, George M. Church
and Heng Li 2,3 ✉
Haplotype-resolved or phased genome assembly provides
a complete picture of genomes and their complex genetic
variations. However, current algorithms for phased assembly
either do not generate chromosome-scale phasing or require
pedigree information, which limits their application. We present a method named diploid assembly (DipAsm) that uses
long, accurate reads and long-range conformation data for
single individuals to generate a chromosome-scale phased
assembly within 1 day. Applied to four public human genomes,
PGP1, HG002, NA12878 and HG00733, DipAsm produced
haplotype-resolved assemblies with minimum contig length
needed to cover 50% of the known genome (NG50) up to
25 Mb and phased ~99.5% of heterozygous sites at 98–99%
accuracy, outperforming other approaches in terms of both
contiguity and phasing completeness. We demonstrate the
importance of chromosome-scale phased assemblies for the
discovery of structural variants (SVs), including thousands
of new transposon insertions, and of highly polymorphic and
medically important regions such as the human leukocyte
antigen (HLA) and killer cell immunoglobulin-like receptor
(KIR) regions. DipAsm will facilitate high-quality precision
medicine and studies of individual haplotype variation and
population diversity.
Humans contain two homologous copies of every chromosome,
and deriving the genome sequence of each copy is essential to correctly understand allele-specific DNA methylation and gene expression, and to analyze evolution, forensics and genetic diseases1.
However, traditional de novo assembly algorithms that reconstruct
genome sequences often represent the sample as a haploid genome.
For a diploid genome such as the human genome, this collapsed
representation results in the loss of half of heterozygous variations
in the genome, may introduce assembly errors in regions diverged
between haplotypes and may lead to inflated assembly for species
with high heterozygosity2. Several algorithms have been proposed
to generate haplotype-resolved assemblies, also known as phased
assemblies. Early efforts such as FALCON-Unzip3, Supernova4
and our previous work5 used relatively short-range sequence data
1
✉
for phasing and can resolve haplotypes only up to several megabases for human samples. These methods are unable to phase
through centromeres or long repeats. FALCON-Phase6, which
extends FALCON-Unzip, uses Hi-C to connect phased sequence
blocks and can generate longer haplotypes, but it cannot achieve
chromosome-long phasing. Trio binning7,8 is the only published
method that can do this, plus the assembly and phasing of entire
chromosomes. It uses sequence reads from both parents to partition the offspring’s long reads and then assemble each partition
separately. However, trio binning is unable to resolve regions heterozygous in all three samples in the trio and will leave such regions
unphased. More importantly, parental samples are not always available—for example, for samples caught in the wild or when parents
are deceased. For Mendelian diseases, de novo mutations in the offspring will not be captured and phased with the parents if there are
no other heterozygotes nearby. This limits the application of trio
binning. Therefore, we currently lack methods that can accurately
produce phased assembly for a single individual and keep pace with
sequence technology innovations.
To overcome the limitations in existing methods, we combined
recent advances in long-read assembly and Hi-C-based phasing to
develop DipAsm, which accurately reconstructs the two haplotypes
in a diploid individual using only PacBio’s long high-fidelity (HiFi)
reads9 and Hi-C data10, both at ~30-fold coverage, without any pedigree information (Fig. 1). Starting with an unphased Peregrine11
assembly scaffolded by 3D-DNA12 or HiRise13, our pipeline calls
small variants with DeepVariant14, phases them with WhatsHap15
and HapCUT2 (ref. 16), partitions the reads and assembles each
partition independently with Peregrine again (Methods). Grouping
contigs into chromosome-long scaffolds is necessary for phasing of
entire chromosomes by WhatsHap and HapCUT2.
We demonstrate our method on four human genomes: PGP1
from the Personal Genome Project, HG002 and NA12878 from
the Genome in a Bottle dataset17,18 (GIAB) and HG00733 from the
Human Genome Structural Variation Consortium (HGSVC)19. We
produced HiFi data for the PGP1 genome and Hi-C data for HG002
and HG00733, and assembled the samples with DipAsm (Table 1).
Department of Genetics, Harvard Medical School, Boston, MA, USA. 2Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. 4DNAnexus, Mountain View, CA, USA. 5Google, Mountain View,
CA, USA. 6Arima Genomics, San Diego, CA, USA. 7Pacific Biosciences, Menlo Park, CA, USA. 8Dovetail Genomics, Scotts Valley, CA, USA. 9Human
Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA. 10Max Planck Institute for Molecular Genetics, Berlin, Germany. 11Material
Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA. 12Saarland University, Saarbrücken, Germany. 13Max
Planck Institute for Informatics, Saarbrücken, Germany. ✉e-mail: ; ; ;
1
3
Nature Biotechnology | VOL 39 | March 2021 | 309–312 | www.nature.com/naturebiotechnology
309
Letters
Hi-C
Nature Biotechnology
PacBio HiFi
(1)
(2)
(3)
Variant calls
(4)
Phased
variants
(5)
Phased
reads
(6)
Phased
contigs
Unphased contigs
Unphased
scaffolds
PacBio HiFi and Hi-C
Fig. 1 | Outline of the phased assembly algorithm, DipAsm. Assemble
HiFi reads into unphased contigs using Peregrine (1); group and order
contigs into scaffolds with Hi-C data using HiRise/3D-DNA (3D de novo
assembly) (2); map HiFi reads to scaffolds and call heterozygous SNPs
using DeepVariant (3); phase heterozygous SNP calls with both HiFi and
Hi-C data using WhatsHap plus HapCUT2 (4); partition reads based on
their phase using WhatsHap (5); assemble partitioned reads into phased
contigs using Peregrine (6).
For HG002, we also generated a trio-binning-based assembly
with Peregrine using parental Illumina reads (Trio Peregrine in
Table 1) and obtained a published Trio Canu assembly9 for comparison (Table 1). All HG002 assemblies took the same HiFi data
as input. For HG00733, we downloaded a FALCON-Phase assembly6 and a recent assembly assembled from HiFi and Strand-seq20.
The Strand-seq assembly and our asse (...truncated)