Efficient combination of multiple word models for improved sequence comparison
Xiaoqiu Huang
2
Liang Ye
2
Hui-Hsien Chou
1
2
I-Hsuan Yang
0
Kun-Mao Chao
0
0
Department of Computer Science and Information Engineering, National Taiwan University
, Taipei,
Taiwan
1
Department of Genetics, Development and Cell Biology, Iowa State University
, Ames,
IA 50011-1040, USA
2
Department of Computer Science
Motivation: Studies of efficient and sensitive sequence comparison methods are driven by a need to find homologous regions of weak similarity between large genomes. Results: We describe an improved method for finding similar regions between two sets of DNA sequences. The new method generalizes existing methods by locating word matches between sequences under two or more word models and extending word matches into high-scoring segment pairs (HSPs). The method is implemented as a computer program named DDS2. Experimental results show that DDS2 can find more HSPs by using several word models than by using one word model. Availability: The DDS2 program is freely available for academic use in binary code form at http://bioinformatics.iastate.edu/aat/align/align.html and in source code form from the corresponding author. Contact:
-
INTRODUCTION
A number of fast comparison programs have been developed
for analysis of genomic DNA sequences (Pearson and Lipman,
1988; Altschul et al., 1990, 1997; Gish, unpublished data;
Huang et al., 1997; Delcher et al., 1999; Burkhardt et al.,
1999; Kurtz and Schleiermacher, 1999; Zhang et al., 2000;
Ning et al., 2001; Kent, 2002; Ma et al., 2002; Schwartz
et al., 2003). The BLASTN program (Altschul et al., 1990)
is widely used for finding homologous similarities between
DNA sequences. It computes high-scoring segment pairs
(HSPs) between sequences by locating exact word matches
of certain length between sequences and extending each word
match into an HSP. The PatternHunter program (Ma et al.,
2002) enhances BLASTN, in sensitivity, by allowing base
differences in word matches. Word matches are defined with
respect to a word model. A word model of length k is specified
by a binary string of k bits. A position at which the model has
a 1 bit is called a checked position. The number of checked
positions in the model is the weight of the model. Two words
of length k form a word match under a word model if bases
at every checked position are identical. For example, the two
words ACGTC and ATGAC form a word match under the
word model of 10101 of length 5 and weight 3. Note that an
exact word match of length k is a word match under a word
model of length k and weight k, which is called a consecutive
word model of length k. The BLASTZ program (Schwartz
et al., 2003) takes the idea of PatternHunter further by
allowing a transition (AG, GA, CT or TC) in any one of the
checked positions.
The sensitivity of a model is the probability of generating
a word match in a fixed-length region of a given percentage
identity. An optimal word model of length 18 and weight 11
has a sensitivity value of 0.467 for HSPs of length 64 and
70% identity, whereas a consecutive word model of length 11
has a sensitivity value of 0.3 (Ma et al., 2002). More HSPs
can be found by PatternHunter in different runs with
different models. However, results from the different runs have a
large number of HSPs in common and a lot of time is spent on
computing HSPs that are already computed in previous runs.
In this paper, we describe an efficient algorithm for finding
HSPs under a set of word models simultaneously. If an HSP
contains a word match under one of the word models, then the
HSP is reported and no additional time is spent on the HSP.
In addition, HSPs that contain a transitive word match are
computed and reported once. Words x and y form a transitive
word match if there is a word z such that words x and z form a
match under one model in the set, and words y and z form
a match under another model in the set. The algorithm is
implemented as a computer program named DDS2.
Experimental results produced by DDS2 on sequences of human
chromosome 21 and mouse chromosome 16 indicate that
DDS2 can find more HSPs under a set of three word models
than under one optimal word model.
We describe an algorithm for computing HSPs between two
sets of sequences under a set of word models. The sequences
in one set are called query sequences and those in the other are
called database sequences. The query and database sequences
are concatenated together with a special boundary
character inserted at each sequence boundary, where the query
sequences are placed before the database sequences. The
resulting sequence is called the combined sequence. The
concatenation of the two sets of sequences is a slightly efficient
way to represent each of the query and database sequence
positions by a unique identifier, which is the location of the
position in the combined sequence. The unique representation of
sequence positions is used by the algorithm to deal with sets of
query and database sequence positions. An alternative method
is to represent each sequence position by three identifiers: data
set id, sequence id and position id. Assume that the first
position of the combined sequence starts at 1, where the value 0 is
used to indicate that a set of positions is empty. Any word of
the combined sequence with the special boundary character
or any irregular base is not considered in the following steps.
An alphabet of size 4, corresponding to the four regular base
types, is used to reduce the space requirement of the algorithm.
Two positions of the combined sequence are equivalent under
a word model of length k if the words of length k starting at the
two positions consist only of regular bases and form a match
under the model. Two positions p1 and p2 of the combined
sequence are equivalent if there is a word model in the set
such that the two positions are equivalent under the model, or
there is a position p3 of the combined sequence such that p1
and p3 are equivalent, and p3 and p2 are equivalent. Assume
that each position is equivalent to itself.
The algorithm for finding HSPs between query and database
sequences under the set of word models consists of two major
steps. In step 1, the sets of equivalent positions are computed.
Then every query position is linked to a list of database
positions equivalent to the query position. In step 2, for each query
sequence Q, HSPs between Q and the database sequences are
computed as follows. For each position q of Q and for each
position d in the list of database positions equivalent to q, if
the pair of positions q and d is not covered by any HSP that
is already computed, then a pair of words starting at the pair
of positions is extended into an HSP and the HSP is saved
if its score is greater than a cutoff. HSPs between Q and a
database sequence are combined into high-scoring chains of
HSPs (Wilbur and Lipman, 1983; Huang, 2002). Below we
describe step 1 and parts of step 2 in detail.
In step 1, initially, each position of the combined sequence
is a se (...truncated)