Large Scale Loss of Data in Low-Diversity Illumina Sequencing Libraries Can Be Recovered by Deferred Cluster Calling (pdf)

Article PDF cannot be displayed. You can download it here:

http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0016607&type=printable

Large Scale Loss of Data in Low-Diversity Illumina Sequencing Libraries Can Be Recovered by Deferred Cluster Calling

Osborne CS (2011) Large Scale Loss of Data in Low-Diversity Illumina Sequencing Libraries Can Be Recovered by Deferred Cluster Calling. PLoS ONE 6(1): e16607. doi:10.1371/journal.pone.0016607 Large Scale Loss of Data in Low-Diversity Illumina Sequencing Libraries Can Be Recovered by Deferred Cluster Calling Felix Krueger 0 Simon R. Andrews 0 Cameron S. Osborne 0 Thomas Preiss, Victor Chang Cardiac Research Institute (VCCRI), Australia 0 1 Bioinformatics Group, The Babraham Institute, Cambridge, United Kingdom, 2 Laboratory of Chromatin and Gene Expression, The Babraham Institute , Cambridge , United Kingdom Massively parallel DNA sequencing is capable of sequencing tens of millions of DNA fragments at the same time. However, sequence bias in the initial cycles, which are used to determine the coordinates of individual clusters, causes a loss of fidelity in cluster identification on Illumina Genome Analysers. This can result in a significant reduction in the numbers of clusters that can be analysed. Such low sample diversity is an intrinsic problem of sequencing libraries that are generated by restriction enzyme digestion, such as e4C-seq or reduced-representation libraries. Similarly, this problem can also arise through the combined sequencing of barcoded, multiplexed libraries. We describe a procedure to defer the mapping of cluster coordinates until low-diversity sequences have been passed. This simple procedure can recover substantial amounts of next generation sequencing data that would otherwise be lost. - Funding: This work was supported by funding provided by the BBSRC (UK). CSO is supported by a Bennett Research Fellowship from Leukaemia and Lymphoma Research (http://www.bbsrc.ac.uk/ and http://www.beatbloodcancers.org/.) The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. Next generation sequencing provides unprecedented volumes of data, and is now used routinely to assess global transcription patterns (RNA seq), chromatin modifications (ChIP seq), and nuclear architecture (3C seq), among other applications. The Illumina Genome Analyser IIx is one of a few widely-used next generation sequencing systems. It employs a solid-phase, sequencing-by-synthesis method, where the DNA library, flanked by adapter sequences, is seeded upon on a lawn of oligonucleotides that coats the surface of the lanes on a flow cell. Each attached DNA fragment undergoes multiple rounds of amplification to create a cluster of identical DNA fragments. At each sequencing cycle, a fluorescently-labelled base is incorporated into each fragment in the cluster, and images of the flow cell surface are captured [1]. Image analysis algorithms are applied during the first few cycles to identify the positions of individual clusters (first 4 cycles for SCSv2.5/GOATv1.5 and SCSv2.6/OLBv1.6 or 5 cycles for SCSv2.8/OLBv1.8), which are then monitored through subsequent cycles to generate sequence data; the ability to read sequence from a lane successfully is critically dependent on the ability to correctly map coordinates of the clusters. Since its commercialization, advances have been made to increase the output of the sequencing systems such that Illumina systems are now capable of sequencing tens of millions of DNA fragments in each of the eight lanes on a flow cell. This provides exceptional depth of coverage, and indeed, for organisms with small genomes and certain sequencing applications this provides coverage well in excess of that which is required. Given this potentially surplus depth of coverage, and that sequencing costs still represent a significant expenditure, it is attractive to have the capability to combine the sequencing of multiple libraries in a single experimental lane. Such multiplexing can be achieved by placing unique identifying bases, called a barcode, within the adapter sequence of each individual library in the mixture [2]. For multiplexing to be effective, data from individual libraries need to be sorted during the data processing stage. While Illumina market a multiplexing kit, a more simplistic multiplexing strategy places the barcodes at the junction between the adapter and DNA library. This permits the barcode and DNA library to be sequenced in a single, continuous run. Barcoding in this manner has been reported [2,3]. However, there are implications with this multiplexing in this manner. Firstly, template read-length is sacrificed in order to sequence the barcode, although the read length can be extended if required. Secondly, placement of barcodes at the junction between the sequencing adapter and library will result in low sequence diversity at the start of the resulting library. Some next-generation sequencing applications introduce lowdiversity in the initial bases of a library such that they appear similar to multiplexed libraries. For instance, libraries generated for the analysis of both genome-wide interactions (e.g. e4C seq) and reduced representation bisulphite sequencing rely upon restriction enzyme digestion to fragment the library and incorporate the sequencing adapters, leaving a partial restriction enzyme recognition sequence present at the beginning of all fragments within the library [4,5,6]. The impact of low-diversity in the initial bases of the library has not been reported. Here, we describe how the presence of a low-diversity mixture of sequences during the cluster calling cycles interferes with the mapping of cluster coordinates, and can result in a significant loss of data. Both the degree of diversity in the initial sequences and the cluster density on the flowcell impacts the extent of data loss. However, we find that by deferring the cluster coordinate mapping until the sequencing cycles that immediately follow the initially biased sequence, a maximal number of clusters can be identified. Furthermore, these cluster coordinates can still be used to determine the initially biased sequence. This simple, yet effective approach can dramatically increase the volume of data returned from libraries with a high degree of bias within the initial bases. Results and Discussion We prepared Illumina sequencing libraries using customdesigned adapters that place a unique, four-base barcode sequence at the junction between the adapter and template. Thus the barcodes are sequenced during the first four sequencing cycles, immediately before the template. We combined equimolar amounts of libraries with unique barcodes to load into the same lane of a flow cell for sequencing. Compared to libraries that contain an unbiased initial sequence, we noted that libraries that contained a single barcode, or a mixture of two barcodes yielded significantly fewer sequences (Fig. 1a and Table 1). However, analysis of a sequencing lane that contained four barcoded libraries was not significantly different to (...truncated)