Read length versus Depth of Coverage for Viral Quasispecies Reconstruction
Citation: Zagordi O, Daumer M, Beisel C, Beerenwinkel N (
Read length versus Depth of Coverage for Viral Quasispecies Reconstruction
Osvaldo Zagordi 0
Martin Da umer 0
Christian Beisel 0
Niko Beerenwinkel 0
Art F. Y. Poon, British Columbia Centre for Excellence in HIV/AIDS, Canada
0 1 Institute of Medical Virology, University of Zurich , Zurich , Switzerland , 2 Institute of Immunology and Genetics , Kaiserslautern, Germany , 3 Department of Biosystems Science and Engineering, ETH Zurich , Basel , Switzerland , 4 SIB Swiss Institute of Bioinformatics , Basel , Switzerland
Recent advancements of sequencing technology have opened up unprecedented opportunities in many application areas. Virus samples can now be sequenced efficiently with very deep coverage to infer the genetic diversity of the underlying virus populations. Several sequencing platforms with different underlying technologies and performance characteristics are available for viral diversity studies. Here, we investigate how the differences between two common platforms provided by 454/Roche and Illumina affect viral diversity estimation and the reconstruction of viral haplotypes. Using a mixture of ten HIV clones sequenced with both platforms and additional simulation experiments, we assessed the trade-off between sequencing coverage, read length, and error rate. For fixed costs, short Illumina reads can be generated at higher coverage and allow for detecting variants at lower frequencies. They can also be sufficient to assess the diversity of the sample if sequences are dissimilar enough, but, in general, assembly of full-length haplotypes is feasible only with the longer 454/ Roche reads. The quantitative comparison highlights the advantages and disadvantages of both platforms and provides guidance for the design of viral diversity studies.
-
Next-generation sequencing (NGS) is changing dramatically our
ability to analyze virus populations [1,2]. With NGS, many viral
genomes can be analyzed in parallel in a single sequencing
experiment [3], and by using deep coverage, even rare viral
variants can be detected in genetically heterogeneous populations.
Deep sequencing of intra-host virus populations is becoming an
important tool for studying viruses with a growing number of
applications [4], including, for example, drug resistance [5,6,7,8],
immune escape [9,10], and epidemiology [11,12].
Most NGS-based studies assess viral diversity at each sequence
position separately by inferring single-nucleotide variants (SNVs)
from the read data. SNV calling is complicated by errors that can
occur during sample preparation and sequencing, and statistical
tests have been developed to distinguish technical errors from true
biological SNVs [6,13,14,15]. Since all NGS technologies amplify
and read out individual DNA molecules [3], the co-occurrence of
mutations, or phasing, can also be assessed provided that they are
observed on the same read. By considering entire reads, rather
than individual SNVs, error correction can be significantly
improved, and the structure of the virus population, i.e., the set
of all viral haplotype sequences and their frequencies, can be
inferred over genomic regions as long as the average read length
[13,16]. The local haplotype inference problem is solved by
clustering overlapping reads such that each cluster corresponds to
one viral haplotype [17,18,19].
In highly diverse virus populations, such as RNA or
singlestranded DNA viruses, mutations can be so frequent that they may
be phased even if they are not observed on the same read. This
global haplotype reconstruction problem becomes feasible if SNVs
can be connected by a series of partially overlapping reads. It can
be regarded as a sequence assembly problem from short reads,
with the goal of reconstructing a viral quasispecies, i.e., a set of
related sequences, rather than a single genome. Computational
methods for viral quasispecies assembly include combinatorial
optimization techniques [17,20,21,22,23] and generative
probabilistic models [24,25,26].
SNV calling and local and global haplotype reconstruction
assess viral genetic diversity at different spatial scales, ranging from
single sites to the whole genome. Long-range haplotype
reconstructions are more informative than short-range inference,
because the linkage between mutations often has important
phenotypic consequences. On the other hand, the statistical power
to detect variation is highest for local haplotypes, and the
computational complexity of haplotype assembly increases with
the length of the genomic region. The optimal scale of diversity
estimation also depends on the employed NGS platform and the
read data it generates. Among other factors, NGS technologies
differ in the number of reads they produce per run, the read
length, the error pattern, and the cost per base [3]. However, it is
unknown how sequencing platforms compare across the different
viral diversity estimation tasks.
Here, we address this question and compare the two most
commonly used NGS platforms for viral diversity estimation,
namely 454/Roche pyrosequencing [27] and Illumina Genome
Analyzer [28]. Previously, both platforms have been shown to
exhibit similar mismatch error rates, while 454/Roche had an
increased indel error rate in homopolymeric regions [29]. Instead
of error profiles, we focus here on coverage and read length, two
critical parameters for viral diversity estimation. Whereas 454/
Roche produces longer reads, Illumina reaches higher coverage
per run at lower costs, suggesting more power to detect
lowfrequency local variation with Illumina, but more power to
assemble global haplotypes with 454/Roche data. We investigate
this trade-off by analyzing a mixture of patient-derived viral clones
that has been sequenced on both platforms and by simulated
reads. We show how coverage, read length, and error rate jointly
affect the performance of local and global haplotype inference.
Our results provide guidance for the optimal choice of a NGS
platform in viral diversity studies.
Experimental setup
Samples consisted of a mixture of PCR products from the gag/
pol region of HIV-1 (positions 2253 to 3497 of HXB2 reference),
obtained from plasma isolates of 10 infected patients and cloned
into pCRII-TOPO vector. The isolates were collected as part of
medically indicated HIV drug resistance tests and no additional
samples were drawn for the purpose of this study. After processing,
the PCR products had been routinely archived and were cloned
for the purpose of quality control. All samples had been
anonymized. A requirement for an ethics approval regarding
projects as part of the quality control is not included in the statutes
of the ethics commission of the state of Rhineland-Palatinate,
Germany. The ten clones were mixed in different proportions,
with intended relative frequencies between 0.1 and 50%. An
aliquot of this sample was used as template in a PCR reaction, in
order to study the impact of (...truncated)