Strategies for Achieving High Sequencing Accuracy for Low Diversity Samples and Avoiding Sample Bleeding Using Illumina Platform (pdf)

Article PDF cannot be displayed. You can download it here:

https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0120520&type=printable

Strategies for Achieving High Sequencing Accuracy for Low Diversity Samples and Avoiding Sample Bleeding Using Illumina Platform

RESEARCH ARTICLE Strategies for Achieving High Sequencing Accuracy for Low Diversity Samples and Avoiding Sample Bleeding Using Illumina Platform Abhishek Mitra1,2, Magdalena Skrzypczak3, Krzysztof Ginalski3*, Maga Rowicka1,2,4* 1 Department of Biochemistry and Molecular Biology, University of Texas Medical Branch at Galveston, 301 University Blvd, Galveston, TX, 77555, USA, 2 Institute for Translational Sciences, University of Texas Medical Branch at Galveston, 301 University Blvd, Galveston, TX, 77555, USA, 3 Laboratory of Bioinformatics and Systems Biology, Centre of New Technologies, University of Warsaw, Zwirki i Wigury 93, 02-089 Warsaw, Poland, 4 Sealy Center for Molecular Medicine, University of Texas Medical Branch at Galveston, 301 University Blvd, Galveston, TX, 77555, USA * (KG), (MR) OPEN ACCESS Citation: Mitra A, Skrzypczak M, Ginalski K, Rowicka M (2015) Strategies for Achieving High Sequencing Accuracy for Low Diversity Samples and Avoiding Sample Bleeding Using Illumina Platform. PLoS ONE 10(4): e0120520. doi:10.1371/journal.pone.0120520 Academic Editor: Cees Oudejans, VU University Medical Center, NETHERLANDS Received: April 25, 2014 Accepted: February 5, 2015 Published: April 10, 2015 Copyright: © 2015 Mitra et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability Statement: Data files are available from Short Read Archive, accession numbers SRX231413 and SRX876146. Funding: This study was supported by grant UL1 TR000071 from the national center for advancing translational science (NCATS), National Institutes of Health (M. R. and A. M.) and by grants from the Foundation for Polish Science (TEAM), the National Science Centre (2011/02/A/NZ2/00014) and the European Regional Development Fund under Innovative Economy Programme (POIG.02.02.00-14024/08-00) (K. G. and M. S.) and National Institutes of Health grant R01GM112131 (M. R.). The funders Abstract Sequencing microRNA, reduced representation sequencing, Hi-C technology and any method requiring the use of in-house barcodes result in sequencing libraries with low initial sequence diversity. Sequencing such data on the Illumina platform typically produces low quality data due to the limitations of the Illumina cluster calling algorithm. Moreover, even in the case of diverse samples, these limitations are causing substantial inaccuracies in multiplexed sample assignment (sample bleeding). Such inaccuracies are unacceptable in clinical applications, and in some other fields (e.g. detection of rare variants). Here, we discuss how both problems with quality of low-diversity samples and sample bleeding are caused by incorrect detection of clusters on the flowcell during initial sequencing cycles. We propose simple software modifications (Long Template Protocol) that overcome this problem. We present experimental results showing that our Long Template Protocol remarkably increases data quality for low diversity samples, as compared with the standard analysis protocol; it also substantially reduces sample bleeding for all samples. For comprehensiveness, we also discuss and compare experimental results from alternative approaches to sequencing low diversity samples. First, we discuss how the low diversity problem, if caused by barcodes, can be avoided altogether at the barcode design stage. Second and third, we present modified guidelines, which are more stringent than the manufacturer’s, for mixing low diversity samples with diverse samples and lowering cluster density, which in our experience consistently produces high quality data from low diversity samples. Fourth and fifth, we present rescue strategies that can be applied when sequencing results in low quality data and when there is no more biological material available. In such cases, we propose that the flowcell be re-hybridized and sequenced again using our PLOS ONE | DOI:10.1371/journal.pone.0120520 April 10, 2015 1 / 21 High Accuracy Illumina Sequencing had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Long Template Protocol. Alternatively, we discuss how analysis can be repeated from saved sequencing images using the Long Template Protocol to increase accuracy. Competing Interests: The authors have declared that no competing interests exist. Introduction Next generation sequencing technology is rapidly developing and has become one of the most popular and crucial techniques used today to answer key biomedical questions. The currently dominant sequencing platform is Illumina, used for *85% of samples deposited in NCBI’s Sequence Read Archive (SRA) in 2013 [1]. Many important applications of next generation sequencing (e.g. reduced representation sequencing [2], microRNA sequencing [3], Hi-C technologies [4] and any technique employing custom barcodes, e.g. [5]) result in sequencing libraries with low sequence diversity in the initial bases of the sequenced reads. The standard Illumina data analysis protocol uses only images corresponding to the first four positions in the reads to determine the coordinates of different clusters on the flowcell, which is a key step in sequencing image analysis. Therefore, sequencing libraries with low sequence diversity in the initial four positions leads to sequencing images that pose a considerable challenge to the image recognition algorithm and usually results in low quality data when using the Illumina platform. Moreover, the same software issue that lowers quality of data originating from low initial sequence diversity samples is also a major source of sequencing errors in normal samples. Since the software design creates the problem, we maintain that the most appropriate and logical way to correct it is by modifying the software itself. Therefore, we developed an approach—Long Template Protocol—that solves this problem for the most popular Illumina HiSeq 2000 and HiSeq 2500 platforms. A computational solution to rectify this problem was proposed for the previous Illumina sequencer, Genome Analyzer II [6, 7]. Unfortunately, for reasons discussed below, these methods are not feasible while sequencing on HiSeq. Our solution is to use images corresponding to more than the first four nucleotides to distinguish between clusters of nearby reads on the flowcell. This approach not only very substantially increases the quality of data originating from low initial sequence diversity sample but also improves sequencing accuracy of normal diversity samples and reduces “sample bleeding” (incorrect assignment of the multiplexed samples), and is thus of general interest. For comprehensiveness, we not only present our software modification strategy to improve quality of the Illumina sequencing data (Long Template Protocol), but we also discuss preventive (...truncated)