Correcting errors in shotgun sequences
Martti T. Tammi
0
Erik Arner
0
Ellen Kindlund
0
Bjo rn Andersson
0
0
Center for Genomics and Bioinformatics, Karolinska Institutet
, Berzelius va g 35,
171 77 Stockholm, Sweden
Sequencing errors in combination with repeated regions cause major problems in shotgun sequencing, mainly due to the failure of assembly programs to distinguish single base differences between repeat copies from erroneous base calls. In this paper, a new strategy designed to correct errors in shotgun sequence data using defined nucleotide positions, DNPs, is presented. The method distinguishes single base differences from sequencing errors by analyzing multiple alignments consisting of a read and all its overlaps with other reads. The construction of multiple alignments is performed using a novel pattern matching algorithm, which takes advantage of the symmetry between indices that can be computed for similar words of the same length. This allows for rapid construction of multiple alignments, with no previous pair-wise matching of sequence reads required. Results from a C++ implementation of this method show that up to 99% of sequencing errors can be corrected, while up to 87% of the single base differences remain and up to 80% of the corrected reads contain at most one error. The results also show that the method outperforms the error correction method used in the EULER assembler. The prototype software, MisEd, is freely available from the authors for academic use.
-
Genome sequencing is important for the study and comparison
of organisms and has generated a wealth of new biological
knowledge. However, as more sequence is continuously
produced for different organisms, increasing amounts of
complex repeated regions are encountered. These regions
often contain important biological information (1), and it is
therefore important to be able to efficiently determine their
sequences.
The shotgun sequencing method is today the strategy of
choice for large scale genome sequencing projects. The
method is relatively cost effective and easy to automate and, in
addition, the redundant sequencing increases the accuracy of
the finished sequence. Problems in this approach are mainly
caused by the limited quality of primary sequence data and
the presence of repetitive sequences. Sequencing errors,
especially combined with repeats, often cause problems in
the sequence assembly step due to the inability of assemblers
to distinguish between sequencing errors and single base
differences between repeats. These problems make finishing a
time consuming task, if at all possible. Although several
eukaryote genomes have been published, none of them is
complete in all regions.
One way to simplify and improve shotgun fragment
assembly results is to correct sequencing errors. For example
the EULER (2) and Arachne (3) assembly programs contain
integrated error correction steps. These are, however, not ideal
and further improvements are needed.
A statistical method presented in a previous paper (4) was
developed in order to identify single base differences between
repeat copies for the purpose of correct assembly of repeats.
The differences between repeats are located by constructing
and analyzing multiple alignments consisting of all shotgun
reads in a dataset that may sample several repeat copies.
Detected differences occurring at a certain rate are labeled as
defined nucleotide positions, DNPs. In this paper, we present
an algorithm to correct sequencing errors in shotgun sequence
reads, using the DNP method. By performing error correction
prior to actual shotgun fragment assembly, the complexity of
the task can be reduced (2).
The DNP method uses multiple alignments consisting of a
read and all its overlaps with other reads. The construction of
multiple alignments is computationally demanding, while
large scale sequencing projects require reliable software that is
able to handle large numbers of sequences more and more
efficiently. For this reason, we also describe an algorithm for
finding overlaps between shotgun fragments that can be used
for rapid construction of multiple alignments. Previous
methods commonly use exact matches as seeds for finding
alignments followed by dynamic programming, e.g. TRAP
(5), Phrap (http://www.phrap.org) and ARACHNE (3). These
methods are fast, at a cost of low sensitivity. Our method uses
a novel algorithm for finding approximate matches that
allows for fast overlap detection while maintaining high
sensitivity. The multiple alignments are directly constructed
from q-grams, i.e. words of length q, that match a pattern with
a maximum number of substitutions.
These methods have been implemented in a prototype
program, MisEd, that is available from the authors at no cost
for academic and non-profit users. We also present the results
of a comparison of the performances of MisEd with the error
correction algorithm used in the EULER assembler. The
results show that MisEd outperforms EULER.
MATERIALS AND METHODS
The DNP method presented in (4) discriminates sequencing
errors from real differences between repeat copies, making it a
suitable tool for error correction. This is achieved by
constructing multiple alignments that contain reads sampling
the same region in different repeat copies. The input to the
DNP method is an optimized multiple alignment consisting of
a read and all its overlaps with other reads in the dataset. This
includes true overlaps as well as apparent overlaps, i.e.
overlaps with reads from similar repeat copies.
Our error correction method can be divided into the
following parts: trimming, construction of multiple
alignments and error correction. The sections below describe each
part in detail.
The purpose of the trimming step is to remove unusable ends
of sequences from long runs. It is desirable to trim the reads as
little as possible, since the purpose of the method is to correct
errors rather than trim them out of the dataset. A longer mean
read length leads to higher shotgun coverage in the subsequent
assembly step. Furthermore, the amount of DNPs detected
drops significantly with stringent trimming conditions. In our
previous investigation, an increase in mean read quality from
95.7 to 97.4%, due to more stringent trimming, resulted in a
decrease in shotgun coverage of 22% while the number of
detected DNPs decreased by 29% (4).
The trimming is performed using Phred quality values (6).
A window of length lw is slid along the read from 3 to 5 and
from 5 to 3 with step size sw, until nw consecutive windows
with a mean error rate below a threshold emax have been found.
The starting position of the first window is marked as the
beginning, or end, of the analyzable sequence. A minimum
length of high quality region lminhq is required in order to keep
the read in the database.
Construction of multiple alignments
The construction of multiple alignments consists of two steps:
construction of raw multiple alignments, and optimization.
Construction of raw multiple (...truncated)