Correcting errors in shotgun sequences

Nucleic Acids Research, Aug 2003

Sequencing errors in combination with repeated regions cause major problems in shotgun sequencing, mainly due to the failure of assembly programs to distinguish single base differences between repeat copies from erroneous base calls. In this paper, a new strategy designed to correct errors in shotgun sequence data using defined nucleotide positions, DNPs, is presented. The method distinguishes single base differences from sequencing errors by analyzing multiple alignments consisting of a read and all its overlaps with other reads. The construction of multiple alignments is performed using a novel pattern matching algorithm, which takes advantage of the symmetry between indices that can be computed for similar words of the same length. This allows for rapid construction of multiple alignments, with no previous pair‐wise matching of sequence reads required. Results from a C++ implementation of this method show that up to 99% of sequencing errors can be corrected, while up to 87% of the single base differences remain and up to 80% of the corrected reads contain at most one error. The results also show that the method outperforms the error correction method used in the EULER assembler. The prototype software, MisEd, is freely available from the authors for academic use.

Article PDF cannot be displayed. You can download it here:

https://nar.oxfordjournals.org/content/31/15/4663.full.pdf

Correcting errors in shotgun sequences

Martti T. Tammi 0 Erik Arner 0 Ellen Kindlund 0 Bjo rn Andersson 0 0 Center for Genomics and Bioinformatics, Karolinska Institutet , Berzelius va g 35, 171 77 Stockholm, Sweden Sequencing errors in combination with repeated regions cause major problems in shotgun sequencing, mainly due to the failure of assembly programs to distinguish single base differences between repeat copies from erroneous base calls. In this paper, a new strategy designed to correct errors in shotgun sequence data using defined nucleotide positions, DNPs, is presented. The method distinguishes single base differences from sequencing errors by analyzing multiple alignments consisting of a read and all its overlaps with other reads. The construction of multiple alignments is performed using a novel pattern matching algorithm, which takes advantage of the symmetry between indices that can be computed for similar words of the same length. This allows for rapid construction of multiple alignments, with no previous pair-wise matching of sequence reads required. Results from a C++ implementation of this method show that up to 99% of sequencing errors can be corrected, while up to 87% of the single base differences remain and up to 80% of the corrected reads contain at most one error. The results also show that the method outperforms the error correction method used in the EULER assembler. The prototype software, MisEd, is freely available from the authors for academic use. - Genome sequencing is important for the study and comparison of organisms and has generated a wealth of new biological knowledge. However, as more sequence is continuously produced for different organisms, increasing amounts of complex repeated regions are encountered. These regions often contain important biological information (1), and it is therefore important to be able to efficiently determine their sequences. The shotgun sequencing method is today the strategy of choice for large scale genome sequencing projects. The method is relatively cost effective and easy to automate and, in addition, the redundant sequencing increases the accuracy of the finished sequence. Problems in this approach are mainly caused by the limited quality of primary sequence data and the presence of repetitive sequences. Sequencing errors, especially combined with repeats, often cause problems in the sequence assembly step due to the inability of assemblers to distinguish between sequencing errors and single base differences between repeats. These problems make finishing a time consuming task, if at all possible. Although several eukaryote genomes have been published, none of them is complete in all regions. One way to simplify and improve shotgun fragment assembly results is to correct sequencing errors. For example the EULER (2) and Arachne (3) assembly programs contain integrated error correction steps. These are, however, not ideal and further improvements are needed. A statistical method presented in a previous paper (4) was developed in order to identify single base differences between repeat copies for the purpose of correct assembly of repeats. The differences between repeats are located by constructing and analyzing multiple alignments consisting of all shotgun reads in a dataset that may sample several repeat copies. Detected differences occurring at a certain rate are labeled as defined nucleotide positions, DNPs. In this paper, we present an algorithm to correct sequencing errors in shotgun sequence reads, using the DNP method. By performing error correction prior to actual shotgun fragment assembly, the complexity of the task can be reduced (2). The DNP method uses multiple alignments consisting of a read and all its overlaps with other reads. The construction of multiple alignments is computationally demanding, while large scale sequencing projects require reliable software that is able to handle large numbers of sequences more and more efficiently. For this reason, we also describe an algorithm for finding overlaps between shotgun fragments that can be used for rapid construction of multiple alignments. Previous methods commonly use exact matches as seeds for finding alignments followed by dynamic programming, e.g. TRAP (5), Phrap (http://www.phrap.org) and ARACHNE (3). These methods are fast, at a cost of low sensitivity. Our method uses a novel algorithm for finding approximate matches that allows for fast overlap detection while maintaining high sensitivity. The multiple alignments are directly constructed from q-grams, i.e. words of length q, that match a pattern with a maximum number of substitutions. These methods have been implemented in a prototype program, MisEd, that is available from the authors at no cost for academic and non-profit users. We also present the results of a comparison of the performances of MisEd with the error correction algorithm used in the EULER assembler. The results show that MisEd outperforms EULER. MATERIALS AND METHODS The DNP method presented in (4) discriminates sequencing errors from real differences between repeat copies, making it a suitable tool for error correction. This is achieved by constructing multiple alignments that contain reads sampling the same region in different repeat copies. The input to the DNP method is an optimized multiple alignment consisting of a read and all its overlaps with other reads in the dataset. This includes true overlaps as well as apparent overlaps, i.e. overlaps with reads from similar repeat copies. Our error correction method can be divided into the following parts: trimming, construction of multiple alignments and error correction. The sections below describe each part in detail. The purpose of the trimming step is to remove unusable ends of sequences from long runs. It is desirable to trim the reads as little as possible, since the purpose of the method is to correct errors rather than trim them out of the dataset. A longer mean read length leads to higher shotgun coverage in the subsequent assembly step. Furthermore, the amount of DNPs detected drops significantly with stringent trimming conditions. In our previous investigation, an increase in mean read quality from 95.7 to 97.4%, due to more stringent trimming, resulted in a decrease in shotgun coverage of 22% while the number of detected DNPs decreased by 29% (4). The trimming is performed using Phred quality values (6). A window of length lw is slid along the read from 3 to 5 and from 5 to 3 with step size sw, until nw consecutive windows with a mean error rate below a threshold emax have been found. The starting position of the first window is marked as the beginning, or end, of the analyzable sequence. A minimum length of high quality region lminhq is required in order to keep the read in the database. Construction of multiple alignments The construction of multiple alignments consists of two steps: construction of raw multiple alignments, and optimization. Construction of raw multiple (...truncated)


This is a preview of a remote PDF: https://nar.oxfordjournals.org/content/31/15/4663.full.pdf
Article home page: http://nar.oxfordjournals.org/content/31/15/4663.abstract

Martti T. Tammi, Erik Arner, Ellen Kindlund, Björn Andersson. Correcting errors in shotgun sequences, Nucleic Acids Research, 2003, pp. 4663-4672, 31/15, DOI: 10.1093/nar/gkg653;