Read length versus Depth of Coverage for Viral Quasispecies Reconstruction (pdf)

Article PDF cannot be displayed. You can download it here:

https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0047046&type=printable

Read length versus Depth of Coverage for Viral Quasispecies Reconstruction

Citation: Zagordi O, Daumer M, Beisel C, Beerenwinkel N ( Read length versus Depth of Coverage for Viral Quasispecies Reconstruction Osvaldo Zagordi 0 Martin Da umer 0 Christian Beisel 0 Niko Beerenwinkel 0 Art F. Y. Poon, British Columbia Centre for Excellence in HIV/AIDS, Canada 0 1 Institute of Medical Virology, University of Zurich , Zurich , Switzerland , 2 Institute of Immunology and Genetics , Kaiserslautern, Germany , 3 Department of Biosystems Science and Engineering, ETH Zurich , Basel , Switzerland , 4 SIB Swiss Institute of Bioinformatics , Basel , Switzerland Recent advancements of sequencing technology have opened up unprecedented opportunities in many application areas. Virus samples can now be sequenced efficiently with very deep coverage to infer the genetic diversity of the underlying virus populations. Several sequencing platforms with different underlying technologies and performance characteristics are available for viral diversity studies. Here, we investigate how the differences between two common platforms provided by 454/Roche and Illumina affect viral diversity estimation and the reconstruction of viral haplotypes. Using a mixture of ten HIV clones sequenced with both platforms and additional simulation experiments, we assessed the trade-off between sequencing coverage, read length, and error rate. For fixed costs, short Illumina reads can be generated at higher coverage and allow for detecting variants at lower frequencies. They can also be sufficient to assess the diversity of the sample if sequences are dissimilar enough, but, in general, assembly of full-length haplotypes is feasible only with the longer 454/ Roche reads. The quantitative comparison highlights the advantages and disadvantages of both platforms and provides guidance for the design of viral diversity studies. - Next-generation sequencing (NGS) is changing dramatically our ability to analyze virus populations [1,2]. With NGS, many viral genomes can be analyzed in parallel in a single sequencing experiment [3], and by using deep coverage, even rare viral variants can be detected in genetically heterogeneous populations. Deep sequencing of intra-host virus populations is becoming an important tool for studying viruses with a growing number of applications [4], including, for example, drug resistance [5,6,7,8], immune escape [9,10], and epidemiology [11,12]. Most NGS-based studies assess viral diversity at each sequence position separately by inferring single-nucleotide variants (SNVs) from the read data. SNV calling is complicated by errors that can occur during sample preparation and sequencing, and statistical tests have been developed to distinguish technical errors from true biological SNVs [6,13,14,15]. Since all NGS technologies amplify and read out individual DNA molecules [3], the co-occurrence of mutations, or phasing, can also be assessed provided that they are observed on the same read. By considering entire reads, rather than individual SNVs, error correction can be significantly improved, and the structure of the virus population, i.e., the set of all viral haplotype sequences and their frequencies, can be inferred over genomic regions as long as the average read length [13,16]. The local haplotype inference problem is solved by clustering overlapping reads such that each cluster corresponds to one viral haplotype [17,18,19]. In highly diverse virus populations, such as RNA or singlestranded DNA viruses, mutations can be so frequent that they may be phased even if they are not observed on the same read. This global haplotype reconstruction problem becomes feasible if SNVs can be connected by a series of partially overlapping reads. It can be regarded as a sequence assembly problem from short reads, with the goal of reconstructing a viral quasispecies, i.e., a set of related sequences, rather than a single genome. Computational methods for viral quasispecies assembly include combinatorial optimization techniques [17,20,21,22,23] and generative probabilistic models [24,25,26]. SNV calling and local and global haplotype reconstruction assess viral genetic diversity at different spatial scales, ranging from single sites to the whole genome. Long-range haplotype reconstructions are more informative than short-range inference, because the linkage between mutations often has important phenotypic consequences. On the other hand, the statistical power to detect variation is highest for local haplotypes, and the computational complexity of haplotype assembly increases with the length of the genomic region. The optimal scale of diversity estimation also depends on the employed NGS platform and the read data it generates. Among other factors, NGS technologies differ in the number of reads they produce per run, the read length, the error pattern, and the cost per base [3]. However, it is unknown how sequencing platforms compare across the different viral diversity estimation tasks. Here, we address this question and compare the two most commonly used NGS platforms for viral diversity estimation, namely 454/Roche pyrosequencing [27] and Illumina Genome Analyzer [28]. Previously, both platforms have been shown to exhibit similar mismatch error rates, while 454/Roche had an increased indel error rate in homopolymeric regions [29]. Instead of error profiles, we focus here on coverage and read length, two critical parameters for viral diversity estimation. Whereas 454/ Roche produces longer reads, Illumina reaches higher coverage per run at lower costs, suggesting more power to detect lowfrequency local variation with Illumina, but more power to assemble global haplotypes with 454/Roche data. We investigate this trade-off by analyzing a mixture of patient-derived viral clones that has been sequenced on both platforms and by simulated reads. We show how coverage, read length, and error rate jointly affect the performance of local and global haplotype inference. Our results provide guidance for the optimal choice of a NGS platform in viral diversity studies. Experimental setup Samples consisted of a mixture of PCR products from the gag/ pol region of HIV-1 (positions 2253 to 3497 of HXB2 reference), obtained from plasma isolates of 10 infected patients and cloned into pCRII-TOPO vector. The isolates were collected as part of medically indicated HIV drug resistance tests and no additional samples were drawn for the purpose of this study. After processing, the PCR products had been routinely archived and were cloned for the purpose of quality control. All samples had been anonymized. A requirement for an ethics approval regarding projects as part of the quality control is not included in the statutes of the ethics commission of the state of Rhineland-Palatinate, Germany. The ten clones were mixed in different proportions, with intended relative frequencies between 0.1 and 50%. An aliquot of this sample was used as template in a PCR reaction, in order to study the impact of (...truncated)