Aquila enables reference-assisted diploid personal genome assembly and comprehensive variant detection based on linked reads (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41467-021-21395-x.pdf

Aquila enables reference-assisted diploid personal genome assembly and comprehensive variant detection based on linked reads

ARTICLE https://doi.org/10.1038/s41467-021-21395-x OPEN Aquila enables reference-assisted diploid personal genome assembly and comprehensive variant detection based on linked reads 1234567890():,; Xin Zhou 1,4 ✉, Lu Zhang2,5, Ziming Weng2, David L. Dill1 & Arend Sidow 2,3 ✉ We introduce Aquila, a new approach to variant discovery in personal genomes, which is critical for uncovering the genetic contributions to health and disease. Aquila uses a reference sequence and linked-read data to generate a high quality diploid genome assembly, from which it then comprehensively detects and phases personal genetic variation. The contigs of the assemblies from our libraries cover >95% of the human reference genome, with over 98% of that in a diploid state. Thus, the assemblies support detection and accurate genotyping of the most prevalent types of human genetic variation, including single nucleotide polymorphisms (SNPs), small insertions and deletions (small indels), and structural variants (SVs), in all but the most difﬁcult regions. All heterozygous variants are phased in blocks that can approach arm-level length. The ﬁnal output of Aquila is a diploid and phased personal genome sequence, and a phased Variant Call Format (VCF) ﬁle that also contains homozygous and a few unphased heterozygous variants. Aquila represents a cost-effective approach that can be applied to cohorts for variation discovery or association studies, or to single individuals with rare phenotypes that could be caused by SVs or compound heterozygosity. 1 Department of Computer Science, Stanford University, Stanford, CA, USA. 2 Department of Pathology, Stanford University, Stanford, CA, USA. 3 Department of Genetics, Stanford University, Stanford, CA, USA. 4Present address: Department of Biomedical Engineering, Vanderbilt University, Nashville, TN, USA. 5 Present address: Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong. ✉email: ; NATURE COMMUNICATIONS | (2021)12:1077 | https://doi.org/10.1038/s41467-021-21395-x | www.nature.com/naturecommunications 1 ARTICLE D NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-021-21395-x espite recent advances, quantifying the contribution of genetic variation to speciﬁc disease risk is a stubborn biomedical problem that remains far from solved. In general, understanding the relationship between genotype and phenotype requires complete ascertainment of genotype, which for humans has yet to be achieved in a scalable fashion. At this stage in technology development, DNA sequencing still faces a vexing tradeoff between cost and completeness so that discovery of variation in larger cohorts is limited to SNPs and small indels. In fact, the relatively low cost of Illumina-based short-fragment whole genome sequencing and the even lower cost of exomes and genotyping arrays has caused considerable ascertainment bias such that the vast majority of genotype–phenotype associations focus on SNPs with small effect, even though the undetected larger variation is known to involve roughly as many bases in our genomes as SNPs and is therefore predicted to have signiﬁcant phenotypic impact as well1,2. Also generally missing is the phasing of genetic variation, which is similarly important for estimation of phenotypic impact, as the distinction between cis and trans compound heterozygotes in an essential locus can mean the difference between health and disease3 and is likely to modulate risk of multigenic disease as well4. Single-molecule sequencing approaches, particularly Paciﬁc Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), provide potential solutions, as long-range information allows accurate detection of SVs and phasing5–7. Despite recent improvements in base calling, the drawback of ONT is that it still exhibits lower base-pair level accuracy than Illumina. A widely applied solution has been to supplement long reads with higher quality short read data, but these ensemble approaches are difﬁcult to scale to larger cohorts due to the complexity of data generation, integration, and analysis, and have therefore been limited to small sample sizes in proof-of-principle studies8,9. A solution to making long reads more accurate is to sequence the same single molecule multiple times to reduce error, as implemented in the PacBio circular consensus sequencing approach, now called HiFi10–12. However, HiFi requires large amounts of input DNA and several-fold oversampling of the same molecule, a currently expensive proposition for anything but small sample sizes. A relatively recent addition to the DNA-sequencing ecosystem has been pioneered by 10X Genomics, wherein the original large molecules of a gentle DNA preparation are partitioned into microﬂuidic compartments13,14. Via a series of withincompartment molecular biology and subsequent standard steps of library construction and sequencing, barcoded short reads are produced that retain the long-range information of the long fragments of the initial DNA extract. Due to the combination of high base pair-level sequence accuracy and long-range information, 10X/Illumina data therefore support excellent SNP and small indel detection and phasing13, as well as breakpoint detection of large events in cancer13,15,16. For diploid genome reconstruction, 10X developed the de novo assembler, Supernova, which has been shown to produce whole human genome assemblies from 56-fold coverage 10×/Illumina data17,18. The application of assembly approaches to human genomes has been limited even though they allow powerful identiﬁcation of SVs8,19,20. Long-read-based assemblies, such as those from PacBio data performed by FALCON-Unzip21, exhibit respectable contiguity and variant detection but still suffer from high cost9. Supernova assemblies based on 10X/Illumina data are less expensive and allow detection of all types of variation but power is limited because a substantial fraction of the genome is not assembled in a diploid state and genotyping error is still high22. Overall, cost-effective assembly-based approaches still suffer from incomplete resolution of the diploid genome and limited power of 2 variant detection in a personal genome. On the other hand, assembly-based approaches have two advantages: detection of variants is greatly simpliﬁed to pairwise alignments rather than complicated read-map-based inference, which is particularly challenging for indels; and the detection of sequences not present in the reference. Compared to reference-based approaches23, the competitive disadvantage of de novo assembly methods is that they disregard the high information content of the reference. Depending on genetic background, >99% of anybody’s two haplomes outside of centromeres and telomeres is identical to the reference, which therefore constitutes a highly accurate scaffold for personal genomes. It stands to reason that, in principle, an assembly-based method that incorporates information from the refe (...truncated)