Aquila enables reference-assisted diploid personal genome assembly and comprehensive variant detection based on linked reads
ARTICLE
https://doi.org/10.1038/s41467-021-21395-x
OPEN
Aquila enables reference-assisted diploid personal
genome assembly and comprehensive variant
detection based on linked reads
1234567890():,;
Xin Zhou
1,4 ✉, Lu Zhang2,5, Ziming Weng2, David L. Dill1 & Arend Sidow
2,3 ✉
We introduce Aquila, a new approach to variant discovery in personal genomes, which is
critical for uncovering the genetic contributions to health and disease. Aquila uses a reference
sequence and linked-read data to generate a high quality diploid genome assembly, from
which it then comprehensively detects and phases personal genetic variation. The contigs of
the assemblies from our libraries cover >95% of the human reference genome, with over
98% of that in a diploid state. Thus, the assemblies support detection and accurate genotyping of the most prevalent types of human genetic variation, including single nucleotide
polymorphisms (SNPs), small insertions and deletions (small indels), and structural variants
(SVs), in all but the most difficult regions. All heterozygous variants are phased in blocks that
can approach arm-level length. The final output of Aquila is a diploid and phased personal
genome sequence, and a phased Variant Call Format (VCF) file that also contains homozygous and a few unphased heterozygous variants. Aquila represents a cost-effective
approach that can be applied to cohorts for variation discovery or association studies, or to
single individuals with rare phenotypes that could be caused by SVs or compound
heterozygosity.
1 Department of Computer Science, Stanford University, Stanford, CA, USA. 2 Department of Pathology, Stanford University, Stanford, CA, USA. 3 Department
of Genetics, Stanford University, Stanford, CA, USA. 4Present address: Department of Biomedical Engineering, Vanderbilt University, Nashville, TN, USA.
5
Present address: Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong. ✉email: ;
NATURE COMMUNICATIONS | (2021)12:1077 | https://doi.org/10.1038/s41467-021-21395-x | www.nature.com/naturecommunications
1
ARTICLE
D
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-021-21395-x
espite recent advances, quantifying the contribution of
genetic variation to specific disease risk is a stubborn
biomedical problem that remains far from solved. In
general, understanding the relationship between genotype and
phenotype requires complete ascertainment of genotype, which
for humans has yet to be achieved in a scalable fashion. At this
stage in technology development, DNA sequencing still faces a
vexing tradeoff between cost and completeness so that discovery
of variation in larger cohorts is limited to SNPs and small indels.
In fact, the relatively low cost of Illumina-based short-fragment
whole genome sequencing and the even lower cost of exomes and
genotyping arrays has caused considerable ascertainment bias
such that the vast majority of genotype–phenotype associations
focus on SNPs with small effect, even though the undetected
larger variation is known to involve roughly as many bases in our
genomes as SNPs and is therefore predicted to have significant
phenotypic impact as well1,2. Also generally missing is the
phasing of genetic variation, which is similarly important for
estimation of phenotypic impact, as the distinction between cis
and trans compound heterozygotes in an essential locus can mean
the difference between health and disease3 and is likely to modulate risk of multigenic disease as well4.
Single-molecule sequencing approaches, particularly Pacific
Biosciences (PacBio) and Oxford Nanopore Technologies (ONT),
provide potential solutions, as long-range information allows
accurate detection of SVs and phasing5–7. Despite recent
improvements in base calling, the drawback of ONT is that it still
exhibits lower base-pair level accuracy than Illumina. A widely
applied solution has been to supplement long reads with higher
quality short read data, but these ensemble approaches are difficult to scale to larger cohorts due to the complexity of data
generation, integration, and analysis, and have therefore been
limited to small sample sizes in proof-of-principle studies8,9. A
solution to making long reads more accurate is to sequence the
same single molecule multiple times to reduce error, as implemented in the PacBio circular consensus sequencing approach,
now called HiFi10–12. However, HiFi requires large amounts of
input DNA and several-fold oversampling of the same molecule,
a currently expensive proposition for anything but small
sample sizes.
A relatively recent addition to the DNA-sequencing ecosystem
has been pioneered by 10X Genomics, wherein the original large
molecules of a gentle DNA preparation are partitioned into
microfluidic compartments13,14. Via a series of withincompartment molecular biology and subsequent standard steps
of library construction and sequencing, barcoded short reads are
produced that retain the long-range information of the long
fragments of the initial DNA extract. Due to the combination of
high base pair-level sequence accuracy and long-range information, 10X/Illumina data therefore support excellent SNP and
small indel detection and phasing13, as well as breakpoint
detection of large events in cancer13,15,16. For diploid genome
reconstruction, 10X developed the de novo assembler, Supernova,
which has been shown to produce whole human genome
assemblies from 56-fold coverage 10×/Illumina data17,18.
The application of assembly approaches to human genomes
has been limited even though they allow powerful identification
of SVs8,19,20. Long-read-based assemblies, such as those from
PacBio data performed by FALCON-Unzip21, exhibit respectable
contiguity and variant detection but still suffer from high cost9.
Supernova assemblies based on 10X/Illumina data are less
expensive and allow detection of all types of variation but power
is limited because a substantial fraction of the genome is not
assembled in a diploid state and genotyping error is still high22.
Overall, cost-effective assembly-based approaches still suffer from
incomplete resolution of the diploid genome and limited power of
2
variant detection in a personal genome. On the other hand,
assembly-based approaches have two advantages: detection of
variants is greatly simplified to pairwise alignments rather than
complicated read-map-based inference, which is particularly
challenging for indels; and the detection of sequences not present
in the reference.
Compared to reference-based approaches23, the competitive
disadvantage of de novo assembly methods is that they disregard
the high information content of the reference. Depending on
genetic background, >99% of anybody’s two haplomes outside of
centromeres and telomeres is identical to the reference, which
therefore constitutes a highly accurate scaffold for personal genomes. It stands to reason that, in principle, an assembly-based
method that incorporates information from the refe (...truncated)