Good spaced seeds for homology search
BIOINFORMATICS
Vol. 20 no. 7 2004, pages 1053–1059
DOI: 10.1093/bioinformatics/bth037
Good spaced seeds for homology search
Kwok Pui Choi1,2, ∗, Fanfan Zeng3 and Louxin Zhang1
1 Department of Mathematics, 2 Department of Statistics and Applied Probability and
3 School of Computing, National University of Singapore, Singapore 117543
Received on October 7, 2003; revised October 7, 2003; accepted on November 15, 2003
Advance Access publication February 12, 2004
1
INTRODUCTION
The program of aligning genomic sequences from different
species has been extensively used in various applications,
such as gene detection (Yeh et al., 2001), inferring SNPs,
tandem and segmental duplications, and locating intronic and
intergenic regions with potential biological functions (Delcher
et al., 1999; Hardison et al., 1997; Li et al., 2001). With the
fast growing number of genomes being completely sequenced,
sequence alignment has become an indispensable tool in
comparative genomics. This unprecedented demand for comparing long genomic DNA sequences has stimulated the need
to design faster and yet sensitive alignment tools. In recent
years, there has been a surge of alignment programs designed
to meet this need for different purposes, e.g. Lipman and
Pearson, 1985, Altschul et al. (1990, 1997), Huang and Miller
(1991), Gish and States (1993), Zhang et al. (2000), Ning et al.
(2001), Schwartz et al. (2003), Kent (2002), Ma et al. (2002),
to name but a few.
One popular approach to speed up alignment is the filtration
technique as exemplified in the BLAST programs (Altschul
et al., 1990). This approach consists of two steps: (i) ‘search
∗ To whom correspondence should be addressed.
Bioinformatics 20(7) © Oxford University Press 2004; all rights reserved.
step’—it first picks up short contiguous regions in the target sequence that have a perfect match in the query sequence
and (ii) ‘alignment step’—it detects whether each short region
obtained in (i) can be extended into a significant alignment,
and it outputs this alignment, if so. For example, the BLASTN
program of the earliest version first finds perfect matches of
consecutive 11 nt bases between a query sequence and a target DNA sequence, and then extends these exact matches into
local alignments, keeping those with scores that exceed a preassigned threshold. Another program called BLAT developed
by Kent (2002) allows single or near multiple hits of predetermined patterns such as short perfect matches and single
almost perfect matches to trigger a local alignment.
Two conflicting factors—search speed and sensitivity are
at play in the design of sequence alignment programs when
the filtration technique is used. If a smaller k had been used,
the search step would have picked up more shorter regions
due to chance but many of them would have been discarded
in the alignment step, hence an increase in computing time.
On the other hand, if a larger k had been used, significant
alignment regions without any perfect k contiguous matches
would have been missed in the search step, hence a decrease
in the sensitivity of the homology search.
Recently, a novel approach in the search step to trigger a
local alignment was introduced by Ma et al. (2002). Their
program PatternHunter (PH) utilizes a single optimal match
pattern to improve the alignment sensitivity. Such an innovation is important since the general sequence search aims to
identify more homologous sequences, in which the mismatch
positions are unknown. More specifically, PH looks for runs of
18 consecutive nucleotide bases in each sequence, in which the
nucleotide matches are required at the 11 positions according
to the 1s in the string 111 ∗ 1 ∗ ∗1 ∗ 1 ∗ ∗11 ∗ 111. Such a pattern is called a spaced seed. Even in a personal computer with
moderate memory space, PH is able to compare prokaryotic
genomes in seconds, Arabidopsis chromosomes in minutes
and human or mouse chromosomes in hours (Waterston et al.,
2002; Scherer et al., 2003; Ureta-Vidal et al., 2003).
The spaced seed idea in PH motivated several research
groups to work on the problem of identifying optimal spaced
seeds in different sequence alignment models (Keith et al.,
2002; Buhler et al., 2003; Brejovà et al., 2003; Choi and
Zhang, 2003). Assuming that the similarity of the sequences
1053
ABSTRACT
Motivation: Filtration is an important technique used to speed
up local alignment as exemplified in the BLAST programs.
Recently, Ma et al. discovered that better filtering can be
achieved by spacing out the matching positions according to
a certain pattern, instead of contiguous positions to trigger a
local alignment in their PatternHunter program. Such a match
pattern is called a spaced seed.
Results: Our numerical computation shows that the ranks
of spaced seeds (based on sensitivity) change with the
sequences similarity. Since homologous sequences may have
diverse similarity, we assess the sensitivity of spaced seeds
over a range of similarity levels and present a list of good
spaced seeds for facilitating homology search in DNA genomic sequences. We validate that the listed spaced seeds are
indeed more sensitive using three arbitrarily chosen pairs of
DNA genomic sequences.
Contact:
K.P.Choi et al.
1054
2 GOOD SPACED SEEDS
2.1 Sensitivity of spaced seeds in local alignment
As mentioned before, the BLAST programs look for a perfect
match of k contiguous bases that appear in both the query and
target sequences in the search step in the filtration method.
The novelty of the idea introduced in Ma et al. (2002) is that
better filtering can be achieved by spacing out the k matching
positions. Since we still require only k matches, better filtering
is achieved without sacrificing the speed in the search step.
Such a pattern of the matching positions is called a ‘spaced
seed’ in their paper. We denote a spaced seed by a string on
{1, ∗}, where 1s indicate exact match positions; and ∗s indicate
positions which are not required to match (called the ‘don’t
care’ positions). Suppose that the spaced seed Q = 1 ∗ ∗11 ∗ 1
is adopted, then for all 7mer from the query sequence, we
require a match at the positions 0, 3, 4 and 6 (where we number
the positions of 1s in the spaced seed from 0). For example, if
the query and target sequences are respectively gcaattgccg
and acgattgctg, then the 7mer caattgc and attgccg
in the query sequence hit the target sequence at positions 8 and
10, respectively; whereas the 7mer gcaattg does not hit the
target sequence at all. Alternatively, we specify a spaced seed
by the relative positions of the 1s in the seed (Burkhardt and
Kärkkäinen, 2001). For example, the seed Q given above has
the set of relative positions {0, 3, 4, 6}. The number of 1s in
a seed is called its weight and its overall length is called its
length.
To measure the sensitivity of a given spaced seed, we adopt
the same probability model (PH model) as in Ma et al. (2002).
Assume that S and S are two DNA sequences of length n
su (...truncated)