Large Scale Loss of Data in Low-Diversity Illumina Sequencing Libraries Can Be Recovered by Deferred Cluster Calling
Osborne CS (2011) Large Scale Loss of Data in Low-Diversity Illumina Sequencing Libraries Can Be Recovered by Deferred
Cluster Calling. PLoS ONE 6(1): e16607. doi:10.1371/journal.pone.0016607
Large Scale Loss of Data in Low-Diversity Illumina Sequencing Libraries Can Be Recovered by Deferred Cluster Calling
Felix Krueger 0
Simon R. Andrews 0
Cameron S. Osborne 0
Thomas Preiss, Victor Chang Cardiac Research Institute (VCCRI), Australia
0 1 Bioinformatics Group, The Babraham Institute, Cambridge, United Kingdom, 2 Laboratory of Chromatin and Gene Expression, The Babraham Institute , Cambridge , United Kingdom
Massively parallel DNA sequencing is capable of sequencing tens of millions of DNA fragments at the same time. However, sequence bias in the initial cycles, which are used to determine the coordinates of individual clusters, causes a loss of fidelity in cluster identification on Illumina Genome Analysers. This can result in a significant reduction in the numbers of clusters that can be analysed. Such low sample diversity is an intrinsic problem of sequencing libraries that are generated by restriction enzyme digestion, such as e4C-seq or reduced-representation libraries. Similarly, this problem can also arise through the combined sequencing of barcoded, multiplexed libraries. We describe a procedure to defer the mapping of cluster coordinates until low-diversity sequences have been passed. This simple procedure can recover substantial amounts of next generation sequencing data that would otherwise be lost.
-
Funding: This work was supported by funding provided by the BBSRC (UK). CSO is supported by a Bennett Research Fellowship from Leukaemia and Lymphoma
Research (http://www.bbsrc.ac.uk/ and http://www.beatbloodcancers.org/.) The funders had no role in study design, data collection and analysis, decision to
publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
Next generation sequencing provides unprecedented volumes of
data, and is now used routinely to assess global transcription
patterns (RNA seq), chromatin modifications (ChIP seq), and
nuclear architecture (3C seq), among other applications. The
Illumina Genome Analyser IIx is one of a few widely-used next
generation sequencing systems. It employs a solid-phase,
sequencing-by-synthesis method, where the DNA library, flanked by
adapter sequences, is seeded upon on a lawn of oligonucleotides
that coats the surface of the lanes on a flow cell. Each attached
DNA fragment undergoes multiple rounds of amplification to
create a cluster of identical DNA fragments. At each sequencing
cycle, a fluorescently-labelled base is incorporated into each
fragment in the cluster, and images of the flow cell surface are
captured [1]. Image analysis algorithms are applied during the first
few cycles to identify the positions of individual clusters (first 4
cycles for SCSv2.5/GOATv1.5 and SCSv2.6/OLBv1.6 or 5
cycles for SCSv2.8/OLBv1.8), which are then monitored through
subsequent cycles to generate sequence data; the ability to read
sequence from a lane successfully is critically dependent on the
ability to correctly map coordinates of the clusters. Since its
commercialization, advances have been made to increase the
output of the sequencing systems such that Illumina systems are
now capable of sequencing tens of millions of DNA fragments in
each of the eight lanes on a flow cell. This provides exceptional
depth of coverage, and indeed, for organisms with small genomes
and certain sequencing applications this provides coverage well in
excess of that which is required.
Given this potentially surplus depth of coverage, and that
sequencing costs still represent a significant expenditure, it is
attractive to have the capability to combine the sequencing of
multiple libraries in a single experimental lane. Such multiplexing
can be achieved by placing unique identifying bases, called a
barcode, within the adapter sequence of each individual library in
the mixture [2]. For multiplexing to be effective, data from
individual libraries need to be sorted during the data processing
stage. While Illumina market a multiplexing kit, a more simplistic
multiplexing strategy places the barcodes at the junction between
the adapter and DNA library. This permits the barcode and DNA
library to be sequenced in a single, continuous run. Barcoding in
this manner has been reported [2,3]. However, there are
implications with this multiplexing in this manner. Firstly,
template read-length is sacrificed in order to sequence the
barcode, although the read length can be extended if required.
Secondly, placement of barcodes at the junction between the
sequencing adapter and library will result in low sequence diversity
at the start of the resulting library.
Some next-generation sequencing applications introduce
lowdiversity in the initial bases of a library such that they appear
similar to multiplexed libraries. For instance, libraries generated
for the analysis of both genome-wide interactions (e.g. e4C seq)
and reduced representation bisulphite sequencing rely upon
restriction enzyme digestion to fragment the library and
incorporate the sequencing adapters, leaving a partial restriction
enzyme recognition sequence present at the beginning of all
fragments within the library [4,5,6]. The impact of low-diversity in
the initial bases of the library has not been reported.
Here, we describe how the presence of a low-diversity mixture
of sequences during the cluster calling cycles interferes with the
mapping of cluster coordinates, and can result in a significant loss
of data. Both the degree of diversity in the initial sequences and the
cluster density on the flowcell impacts the extent of data loss.
However, we find that by deferring the cluster coordinate mapping
until the sequencing cycles that immediately follow the initially
biased sequence, a maximal number of clusters can be identified.
Furthermore, these cluster coordinates can still be used to
determine the initially biased sequence. This simple, yet effective
approach can dramatically increase the volume of data returned
from libraries with a high degree of bias within the initial bases.
Results and Discussion
We prepared Illumina sequencing libraries using
customdesigned adapters that place a unique, four-base barcode sequence
at the junction between the adapter and template. Thus the
barcodes are sequenced during the first four sequencing cycles,
immediately before the template. We combined equimolar
amounts of libraries with unique barcodes to load into the same
lane of a flow cell for sequencing. Compared to libraries that
contain an unbiased initial sequence, we noted that libraries that
contained a single barcode, or a mixture of two barcodes yielded
significantly fewer sequences (Fig. 1a and Table 1). However,
analysis of a sequencing lane that contained four barcoded
libraries was not significantly different to (...truncated)