Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems (pdf)

Article PDF cannot be displayed. You can download it here:

http://genomebiology.com/content/pdf/gb-2011-12-11-r112.pdf

Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems

Minoche et al. Genome Biology Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems Andr E Minoche 0 1 Juliane C Dohm 0 1 Heinz Himmelbauer 0 0 Centre for Genomic Regulation (CRG) and UPF , C. Dr. Aiguader 88, 08003 Barcelona , Spain 1 Max Planck Institute for Molecular Genetics , Ihnestr. 63-73, 14195 Berlin , Germany Background: The generation and analysis of high-throughput sequencing data are becoming a major component of many studies in molecular biology and medical research. Illumina's Genome Analyzer (GA) and HiSeq instruments are currently the most widely used sequencing devices. Here, we comprehensively evaluate properties of genomic HiSeq and GAIIx data derived from two plant genomes and one virus, with read lengths of 95 to 150 bases. Results: We provide quantifications and evidence for GC bias, error rates, error sequence context, effects of quality filtering, and the reliability of quality values. By combining different filtering criteria we reduced error rates 7-fold at the expense of discarding 12.5% of alignable bases. While overall error rates are low in HiSeq data we observed regions of accumulated wrong base calls. Only 3% of all error positions accounted for 24.7% of all substitution errors. Analyzing the forward and reverse strands separately revealed error rates of up to 18.7%. Insertions and deletions occurred at very low rates on average but increased to up to 2% in homopolymers. A positive correlation between read coverage and GC content was found depending on the GC content range. Conclusions: The errors and biases we report have implications for the use and the interpretation of Illumina sequencing data. GAIIx and HiSeq data sets show slightly different error profiles. Quality filtering is essential to minimize downstream analysis artifacts. Supporting previous recommendations, the strand-specificity provides a criterion to distinguish sequencing errors from low abundance polymorphisms. - Background Next generation sequencing (NGS) is revolutionizing molecular biology research with a wide and rapidly growing range of applications. These applications include de novo genome sequencing, re-sequencing, detection and profiling of coding and non-coding transcripts, identification of sequence variants, epigenetic profiling, and interaction mapping. Compared with microarrays, previously used for many of these applications, NGS offers a higher dynamic range, enabling the detection of rare transcripts and splice variants in the transcriptome as well as rare genomic polymorphisms for example, somatic mutations present within cancer samples. The challenge remains to distinguish sequence variation from sequencing errors, and a thorough characterization of NGS data is required in order to detect method-inherent errors and biases. Systematic errors are platform-dependent. In the context of this work, we focus on Illumina data. According to market share analysis, almost two thirds of all NGS instruments presently in operation have been manufactured by Illumina [1]. Existing studies about Illumina data evaluation have revealed several biases, that is, a non-random distribution of the reads in the sequenced sample over the reference (reported for the Genome Analyzer (GA) I [2-5]) and a non-random distribution of errors (GAIIx [6]). Preferences of certain substitution errors and sequence context have been observed. For instance, wrong base calls are frequently preceded by base G [2] and frequencies of base substitutions vary by 10- to 11fold, with A to C conversions being the most frequent error [2,7]. Such errors might have profound implications on the interpretation of results: a non-random read distribution can bias profiling of transcripts and hamper the detection of sequence polymorphisms in regions of low sequence coverage. Errors in the reads can result in false positive variant calls or wrong consensus sequences. The Illumina sequencing technology has been under constant development, relating to instrumentation, signal processing software, and sequencing chemistry, towards the production of more data and longer sequencing reads. The HiSeq2000 became commercially available in the second quarter of 2010 and uses sequencing-by-synthesis (SBS) chemistry similar to the Illumina GA series but at a two- to five-fold increased rate of data acquisition. A HiSeq flow cell can be imaged on both the top and bottom surface. To increase the HiSeq data collection rate, imaging is performed in a line scanning mode, in contrast to the area imaging in the GA. Instead of using only one camera, the HiSeq operates with a four camera system that detects the intensities of all four bases simultaneously. The Hiseq currently runs with lower cluster densities than the GA and with a maximal read length of 100 nucleotides for single reads or 2 100 nucleotides in paired-end mode. Every development of a system can shift error profiles and can reveal new types of errors. Here, we evaluate Illumina sequencing data generated on the latest systems, the GAIIx and HiSeq2000, using current sequencing chemistry and up-to-date base-calling software. We focus on errors and biases that have an impact on common sequencing applications and we provide suggestions on how to trim and filter the reads in order to substantially reduce error rates. Since high quality reference sequences are not always available in a sequencing project, we first report properties of the unprocessed raw reads. Then we assess the error rates and biases after mapping against high quality reference sequences derived from two plants (Beta vulgaris and Arabidopsis thaliana) and the bacterial virus PhiX174. Results We generated genomic paired-end reads of 2 95 nucleotides and 2 100 nucleotides on an Illumina HiSeq2000 sequencing machine and of 2 150 nucleotides on an Illumina GAIIx instrument (Table 1). Three HiSeq flowcell lanes of 2 95-nucleotide reads resulted in 246 million read pairs corresponding to 46.8 billion bases of sequence data. These data were a mix of genomic reads of B. vulgaris (Bv, 99%) and the bacteriophage PhiX174 (PhiX, 1%) spiked in as standard quality control. One HiSeq flowcell lane of 2 100-nucleotide read pairs containing 99% genomic DNA of A. thaliana (At) and 1% PhiX resulted in 71 million read pairs corresponding to 14.3 billion sequenced bases. One lane containing PhiX only was sequenced on a GAIIx and yielded 9 million read pairs of length 2 150 nucleotides (2.7 billion bases). Properties of raw reads and filtering criteria As a first quality evaluation we analyzed the raw read sequences and their corresponding quality values assigned by the base-calling software. The Illumina base-calling software calculates a quality score for each base reflecting the probability that the called base is wrong. The calculation takes into account the ambiguity of the signal for the respective base as well as the quality of neighboring bas (...truncated)