Chromosome-scale, haplotype-resolved assembly of human genomes (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41587-020-0711-0.pdf

Chromosome-scale, haplotype-resolved assembly of human genomes

Letters https://doi.org/10.1038/s41587-020-0711-0 Chromosome-scale, haplotype-resolved assembly of human genomes Shilpa Garg 1,2,3 ✉, Arkarachai Fungtammasan4, Andrew Carroll5, Mike Chou1, Anthony Schmitt6, Xiang Zhou6, Stephen Mac6, Paul Peluso7, Emily Hatas7, Jay Ghurye8, Jared Maguire8, Medhat Mahmoud 9, Haoyu Cheng2,3, David Heller 10, Justin M. Zook 11, Tobias Moemke12, Tobias Marschall 12,13, Fritz J. Sedlazeck 9, John Aach1, Chen-Shan Chin 4 ✉, George M. Church and Heng Li 2,3 ✉ Haplotype-resolved or phased genome assembly provides a complete picture of genomes and their complex genetic variations. However, current algorithms for phased assembly either do not generate chromosome-scale phasing or require pedigree information, which limits their application. We present a method named diploid assembly (DipAsm) that uses long, accurate reads and long-range conformation data for single individuals to generate a chromosome-scale phased assembly within 1 day. Applied to four public human genomes, PGP1, HG002, NA12878 and HG00733, DipAsm produced haplotype-resolved assemblies with minimum contig length needed to cover 50% of the known genome (NG50) up to 25 Mb and phased ~99.5% of heterozygous sites at 98–99% accuracy, outperforming other approaches in terms of both contiguity and phasing completeness. We demonstrate the importance of chromosome-scale phased assemblies for the discovery of structural variants (SVs), including thousands of new transposon insertions, and of highly polymorphic and medically important regions such as the human leukocyte antigen (HLA) and killer cell immunoglobulin-like receptor (KIR) regions. DipAsm will facilitate high-quality precision medicine and studies of individual haplotype variation and population diversity. Humans contain two homologous copies of every chromosome, and deriving the genome sequence of each copy is essential to correctly understand allele-specific DNA methylation and gene expression, and to analyze evolution, forensics and genetic diseases1. However, traditional de novo assembly algorithms that reconstruct genome sequences often represent the sample as a haploid genome. For a diploid genome such as the human genome, this collapsed representation results in the loss of half of heterozygous variations in the genome, may introduce assembly errors in regions diverged between haplotypes and may lead to inflated assembly for species with high heterozygosity2. Several algorithms have been proposed to generate haplotype-resolved assemblies, also known as phased assemblies. Early efforts such as FALCON-Unzip3, Supernova4 and our previous work5 used relatively short-range sequence data 1 ✉ for phasing and can resolve haplotypes only up to several megabases for human samples. These methods are unable to phase through centromeres or long repeats. FALCON-Phase6, which extends FALCON-Unzip, uses Hi-C to connect phased sequence blocks and can generate longer haplotypes, but it cannot achieve chromosome-long phasing. Trio binning7,8 is the only published method that can do this, plus the assembly and phasing of entire chromosomes. It uses sequence reads from both parents to partition the offspring’s long reads and then assemble each partition separately. However, trio binning is unable to resolve regions heterozygous in all three samples in the trio and will leave such regions unphased. More importantly, parental samples are not always available—for example, for samples caught in the wild or when parents are deceased. For Mendelian diseases, de novo mutations in the offspring will not be captured and phased with the parents if there are no other heterozygotes nearby. This limits the application of trio binning. Therefore, we currently lack methods that can accurately produce phased assembly for a single individual and keep pace with sequence technology innovations. To overcome the limitations in existing methods, we combined recent advances in long-read assembly and Hi-C-based phasing to develop DipAsm, which accurately reconstructs the two haplotypes in a diploid individual using only PacBio’s long high-fidelity (HiFi) reads9 and Hi-C data10, both at ~30-fold coverage, without any pedigree information (Fig. 1). Starting with an unphased Peregrine11 assembly scaffolded by 3D-DNA12 or HiRise13, our pipeline calls small variants with DeepVariant14, phases them with WhatsHap15 and HapCUT2 (ref. 16), partitions the reads and assembles each partition independently with Peregrine again (Methods). Grouping contigs into chromosome-long scaffolds is necessary for phasing of entire chromosomes by WhatsHap and HapCUT2. We demonstrate our method on four human genomes: PGP1 from the Personal Genome Project, HG002 and NA12878 from the Genome in a Bottle dataset17,18 (GIAB) and HG00733 from the Human Genome Structural Variation Consortium (HGSVC)19. We produced HiFi data for the PGP1 genome and Hi-C data for HG002 and HG00733, and assembled the samples with DipAsm (Table 1). Department of Genetics, Harvard Medical School, Boston, MA, USA. 2Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA. Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. 4DNAnexus, Mountain View, CA, USA. 5Google, Mountain View, CA, USA. 6Arima Genomics, San Diego, CA, USA. 7Pacific Biosciences, Menlo Park, CA, USA. 8Dovetail Genomics, Scotts Valley, CA, USA. 9Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA. 10Max Planck Institute for Molecular Genetics, Berlin, Germany. 11Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA. 12Saarland University, Saarbrücken, Germany. 13Max Planck Institute for Informatics, Saarbrücken, Germany. ✉e-mail: ; ; ; 1 3 Nature Biotechnology | VOL 39 | March 2021 | 309–312 | www.nature.com/naturebiotechnology 309 Letters Hi-C Nature Biotechnology PacBio HiFi (1) (2) (3) Variant calls (4) Phased variants (5) Phased reads (6) Phased contigs Unphased contigs Unphased scaffolds PacBio HiFi and Hi-C Fig. 1 | Outline of the phased assembly algorithm, DipAsm. Assemble HiFi reads into unphased contigs using Peregrine (1); group and order contigs into scaffolds with Hi-C data using HiRise/3D-DNA (3D de novo assembly) (2); map HiFi reads to scaffolds and call heterozygous SNPs using DeepVariant (3); phase heterozygous SNP calls with both HiFi and Hi-C data using WhatsHap plus HapCUT2 (4); partition reads based on their phase using WhatsHap (5); assemble partitioned reads into phased contigs using Peregrine (6). For HG002, we also generated a trio-binning-based assembly with Peregrine using parental Illumina reads (Trio Peregrine in Table 1) and obtained a published Trio Canu assembly9 for comparison (Table 1). All HG002 assemblies took the same HiFi data as input. For HG00733, we downloaded a FALCON-Phase assembly6 and a recent assembly assembled from HiFi and Strand-seq20. The Strand-seq assembly and our asse (...truncated)