PatternHunter: faster and more sensitive homology search
Vol. 18 no. 3 2002
Pages 440–445
BIOINFORMATICS
PatternHunter: faster and more sensitive
homology search
Bin Ma 1, John Tromp 2 and Ming Li 3
1 Computer Science Department, University of Western Ontario, London N6A 5B8,
Canada, 2 Bioinformatics Solutions Inc., 145 Columbia Street West, Waterloo,
Ont. N2L 3L2, Canada and 3 Computer Science Department, University of Waterloo,
Waterloo, Ont. N2L 3G1, Canada and Bioinformatics Lab, Computer Science
Department, University of California, Santa Barbara, CA 93106, USA
Received on August 24, 2001; revised on October 10, 2001; accepted on October 15, 2001
ABSTRACT
Motivation: Genomics and proteomics studies routinely
depend on homology searches based on the strategy of
finding short seed matches which are then extended. The
exploding genomic data growth presents a dilemma for
DNA homology search techniques: increasing seed size
decreases sensitivity whereas decreasing seed size slows
down computation.
Results: We present a new homology search algorithm
‘PatternHunter’ that uses a novel seed model for increased sensitivity and new hit-processing techniques for
significantly increased speed. At Blast levels of sensitivity,
PatternHunter is able to find homologies between sequences as large as human chromosomes, in mere hours
on a desktop.
Availability: PatternHunter is available at http://www.
bioinformaticssolutions.com, as a commercial package. It
runs on all platforms that support Java. PatternHunter
technology is being patented; commercial use requires a
license from BSI, while non-commercial use will be free.
Contact:
INTRODUCTION
We are interested in faster and more sensitive methods
for finding all approximate repeats or homologies in one
DNA sequence or between two DNA sequences, as performed by the popular Blastn (Altschul et al., 1990) program. One particular application of this task is in comparative genomics where large genomes or chromosomes
such as the human one (International Human Genome Sequencing Consortium, 2001; Venter et al., 2001) need to
be compared.
Many programs have been developed for the task. These
include FASTA (Lipman and Pearson, 1985), SIM (Huang
and Miller, 1991), the Blast family (Altschul et al., 1990;
Gish, 2001; Altschul et al., 1997; Zhang et al., 2000;
Tatusova and Madden, 1999), SENSEI (States, 2000),
440
MUMmer (Delcher et al., 1999), QUASAR (Burkhardt et
al., 1999), and REPuter (Kurtz and Schleiermacher, 1999).
Smith–Waterman alignment which compares all bases
against all bases is clearly too slow. Two lines of approach
lead to improvements. The first is exemplified by Blast,
which is used routinely by thousands of scientists. This
approach finds short exact ‘seed’ matches (hits), which
are then extended into longer alignments. However, when
comparing two very long sequences, FASTA, SIM, Blastn
(BL2SEQ), WU-Blast, and Psi-Blast run very slow and
need large amounts of memory. SENSEI is somewhat
faster and uses much less memory than the above programs, but is currently limited to ungapped alignments.
MegaBlast runs quite efficiently with its default gap scores
and large seed length of 28 but turns out to have worse
output quality and doesn’t scale as well to huge sequences.
Another line of approach, exemplified by MUMmer,
QUASAR and REPuter, uses suffix trees. Suffix trees
suffer from two problems: they are meant to deal with
precise matches and are limited to comparison of highly
similar sequences (Delcher et al., 1999; Burkhardt et al.,
1999; Kurtz and Schleiermacher, 1999). They are very
awkward in handling mismatches. The second problem
with suffix trees is that they have an intrinsic large space
requirement.
We introduce novel seeding schemes and hit-processing
methods, which are implemented in our program PatternHunter. On a modern desktop, its running time ranges
from seconds for prokaryotic genomes to minutes for
Arabidopsis chromosomes to hours for human chromosomes, with very modest memory use, and at provably
higher sensitivity than the default Blastn.
SELECTING GOOD SEEDS: EXPECT LESS TO
GET MORE
A dilemma for a Blast type of search is that large seeds
lose distant homologies while small ones creates too many
c Oxford University Press 2002
PatternHunter
†
For statistical purposes, we count overlapping hits separately, while the
Blast program ignores hits overlapping the last recorded one.
1
110100110010101111
0.9
11111111111
0.8
1111111111
0.7
sensitivity
random hits which slow down the computation. We use a
new idea that allows us to have a higher probability of a
hit in a homologous region, even while having somewhat
lower expected number of random hits.
Blast looks for matches of k (default k = 11 in Blastn
and k = 28 in MegaBlast) consecutive letters as seeds.
Instead we propose to use nonconsecutive k letters as
seeds. We call the relative positions of the k letters a
model, and k its weight.
This seemingly simple change has a surprisingly large
effect on sensitivity. An appropriately chosen model can
have a significantly higher probability of having at least
one hit in a homologous region, compared to Blast’s
consecutive seed model, even while having a lower
expected number of hits† . For example, in a region of
length 64 with 70% identity, Blast’s consecutive weight 11
model has a 0.30 probability of having at least one
hit in the range, while a nonconsecutive model of the
same weight has a 0.466 probability of getting a hit,
see Figure 1. On the other hand, the expected number
of hits in that region by the Blast consecutive model
is 1.07, while the nonconsecutive model expects 0.93
hits. This is because the length 11 model can shift over
54 places within the length 64 window, while the length
18 model has only 47 places to fit. The reason for the
increased sensitivity is that the events, of having a match at
different positions, become more independent for spaced
models. If a model and a shifted copy share many 1s in
the same position, then a base mismatch in any of these
shared positions will make both matches fail, hence the
corresponding matching events are far from independent.
Independent events are better at pooling their success
probabilities together. Generally, the fewer bases shared
by a model and any of its shifted copies, the higher its
sensitivity is. Clearly, by this measure, consecutive models
are the worst, since shift of 1 shares all but one bases.
For convenience, we denote a model by a 0–1 string,
where the 1-positions represent required matches, while
the 0s are ‘don’t cares’. For example, if we use a weight
six model 1110111, then actgact versus acttact is a
seed match, as well as actgact versus actgact. So Blast
uses models of the form 1k . Blast actually matches two
or three bytes, each containing four bases, simultaneously,
and extends these hits to the left and right. This is fine
for the default of k = 11 because any length 11 match
necessarily contains a match of two bytes, but for k smaller
than 11, it will miss some seeds. (...truncated)