Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems
Minoche et al. Genome Biology
Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems
Andr E Minoche 0 1
Juliane C Dohm 0 1
Heinz Himmelbauer 0
0 Centre for Genomic Regulation (CRG) and UPF , C. Dr. Aiguader 88, 08003 Barcelona , Spain
1 Max Planck Institute for Molecular Genetics , Ihnestr. 63-73, 14195 Berlin , Germany
Background: The generation and analysis of high-throughput sequencing data are becoming a major component of many studies in molecular biology and medical research. Illumina's Genome Analyzer (GA) and HiSeq instruments are currently the most widely used sequencing devices. Here, we comprehensively evaluate properties of genomic HiSeq and GAIIx data derived from two plant genomes and one virus, with read lengths of 95 to 150 bases. Results: We provide quantifications and evidence for GC bias, error rates, error sequence context, effects of quality filtering, and the reliability of quality values. By combining different filtering criteria we reduced error rates 7-fold at the expense of discarding 12.5% of alignable bases. While overall error rates are low in HiSeq data we observed regions of accumulated wrong base calls. Only 3% of all error positions accounted for 24.7% of all substitution errors. Analyzing the forward and reverse strands separately revealed error rates of up to 18.7%. Insertions and deletions occurred at very low rates on average but increased to up to 2% in homopolymers. A positive correlation between read coverage and GC content was found depending on the GC content range. Conclusions: The errors and biases we report have implications for the use and the interpretation of Illumina sequencing data. GAIIx and HiSeq data sets show slightly different error profiles. Quality filtering is essential to minimize downstream analysis artifacts. Supporting previous recommendations, the strand-specificity provides a criterion to distinguish sequencing errors from low abundance polymorphisms.
-
Background
Next generation sequencing (NGS) is revolutionizing
molecular biology research with a wide and rapidly
growing range of applications. These applications
include de novo genome sequencing, re-sequencing,
detection and profiling of coding and non-coding
transcripts, identification of sequence variants, epigenetic
profiling, and interaction mapping. Compared with
microarrays, previously used for many of these
applications, NGS offers a higher dynamic range, enabling the
detection of rare transcripts and splice variants in the
transcriptome as well as rare genomic polymorphisms
for example, somatic mutations present within cancer
samples. The challenge remains to distinguish sequence
variation from sequencing errors, and a thorough
characterization of NGS data is required in order to detect
method-inherent errors and biases. Systematic errors are
platform-dependent. In the context of this work, we
focus on Illumina data. According to market share
analysis, almost two thirds of all NGS instruments presently
in operation have been manufactured by Illumina [1].
Existing studies about Illumina data evaluation have
revealed several biases, that is, a non-random
distribution of the reads in the sequenced sample over the
reference (reported for the Genome Analyzer (GA) I
[2-5]) and a non-random distribution of errors (GAIIx
[6]). Preferences of certain substitution errors and
sequence context have been observed. For instance,
wrong base calls are frequently preceded by base G [2]
and frequencies of base substitutions vary by 10- to
11fold, with A to C conversions being the most frequent
error [2,7]. Such errors might have profound
implications on the interpretation of results: a
non-random read distribution can bias profiling of transcripts
and hamper the detection of sequence polymorphisms
in regions of low sequence coverage. Errors in the reads
can result in false positive variant calls or wrong
consensus sequences.
The Illumina sequencing technology has been under
constant development, relating to instrumentation,
signal processing software, and sequencing chemistry,
towards the production of more data and longer
sequencing reads. The HiSeq2000 became commercially
available in the second quarter of 2010 and uses
sequencing-by-synthesis (SBS) chemistry similar to the Illumina
GA series but at a two- to five-fold increased rate of
data acquisition. A HiSeq flow cell can be imaged on
both the top and bottom surface. To increase the HiSeq
data collection rate, imaging is performed in a line
scanning mode, in contrast to the area imaging in the GA.
Instead of using only one camera, the HiSeq operates
with a four camera system that detects the intensities of
all four bases simultaneously. The Hiseq currently runs
with lower cluster densities than the GA and with a
maximal read length of 100 nucleotides for single reads
or 2 100 nucleotides in paired-end mode.
Every development of a system can shift error profiles
and can reveal new types of errors. Here, we evaluate
Illumina sequencing data generated on the latest
systems, the GAIIx and HiSeq2000, using current
sequencing chemistry and up-to-date base-calling software. We
focus on errors and biases that have an impact on
common sequencing applications and we provide
suggestions on how to trim and filter the reads in order to
substantially reduce error rates. Since high quality
reference sequences are not always available in a sequencing
project, we first report properties of the unprocessed
raw reads. Then we assess the error rates and biases
after mapping against high quality reference sequences
derived from two plants (Beta vulgaris and Arabidopsis
thaliana) and the bacterial virus PhiX174.
Results
We generated genomic paired-end reads of 2 95
nucleotides and 2 100 nucleotides on an Illumina
HiSeq2000 sequencing machine and of 2 150
nucleotides on an Illumina GAIIx instrument (Table 1). Three
HiSeq flowcell lanes of 2 95-nucleotide reads resulted
in 246 million read pairs corresponding to 46.8 billion
bases of sequence data. These data were a mix of
genomic reads of B. vulgaris (Bv, 99%) and the bacteriophage
PhiX174 (PhiX, 1%) spiked in as standard quality
control. One HiSeq flowcell lane of 2 100-nucleotide read
pairs containing 99% genomic DNA of A. thaliana (At)
and 1% PhiX resulted in 71 million read pairs
corresponding to 14.3 billion sequenced bases. One lane
containing PhiX only was sequenced on a GAIIx and
yielded 9 million read pairs of length 2 150
nucleotides (2.7 billion bases).
Properties of raw reads and filtering criteria
As a first quality evaluation we analyzed the raw read
sequences and their corresponding quality values
assigned by the base-calling software. The Illumina
base-calling software calculates a quality score for each
base reflecting the probability that the called base is
wrong. The calculation takes into account the ambiguity
of the signal for the respective base as well as the quality
of neighboring bas (...truncated)