Good spaced seeds for homology search (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/20/7/1053.full.pdf

Good spaced seeds for homology search

BIOINFORMATICS Vol. 20 no. 7 2004, pages 1053–1059 DOI: 10.1093/bioinformatics/bth037 Good spaced seeds for homology search Kwok Pui Choi1,2, ∗, Fanfan Zeng3 and Louxin Zhang1 1 Department of Mathematics, 2 Department of Statistics and Applied Probability and 3 School of Computing, National University of Singapore, Singapore 117543 Received on October 7, 2003; revised October 7, 2003; accepted on November 15, 2003 Advance Access publication February 12, 2004 1 INTRODUCTION The program of aligning genomic sequences from different species has been extensively used in various applications, such as gene detection (Yeh et al., 2001), inferring SNPs, tandem and segmental duplications, and locating intronic and intergenic regions with potential biological functions (Delcher et al., 1999; Hardison et al., 1997; Li et al., 2001). With the fast growing number of genomes being completely sequenced, sequence alignment has become an indispensable tool in comparative genomics. This unprecedented demand for comparing long genomic DNA sequences has stimulated the need to design faster and yet sensitive alignment tools. In recent years, there has been a surge of alignment programs designed to meet this need for different purposes, e.g. Lipman and Pearson, 1985, Altschul et al. (1990, 1997), Huang and Miller (1991), Gish and States (1993), Zhang et al. (2000), Ning et al. (2001), Schwartz et al. (2003), Kent (2002), Ma et al. (2002), to name but a few. One popular approach to speed up alignment is the filtration technique as exemplified in the BLAST programs (Altschul et al., 1990). This approach consists of two steps: (i) ‘search ∗ To whom correspondence should be addressed. Bioinformatics 20(7) © Oxford University Press 2004; all rights reserved. step’—it first picks up short contiguous regions in the target sequence that have a perfect match in the query sequence and (ii) ‘alignment step’—it detects whether each short region obtained in (i) can be extended into a significant alignment, and it outputs this alignment, if so. For example, the BLASTN program of the earliest version first finds perfect matches of consecutive 11 nt bases between a query sequence and a target DNA sequence, and then extends these exact matches into local alignments, keeping those with scores that exceed a preassigned threshold. Another program called BLAT developed by Kent (2002) allows single or near multiple hits of predetermined patterns such as short perfect matches and single almost perfect matches to trigger a local alignment. Two conflicting factors—search speed and sensitivity are at play in the design of sequence alignment programs when the filtration technique is used. If a smaller k had been used, the search step would have picked up more shorter regions due to chance but many of them would have been discarded in the alignment step, hence an increase in computing time. On the other hand, if a larger k had been used, significant alignment regions without any perfect k contiguous matches would have been missed in the search step, hence a decrease in the sensitivity of the homology search. Recently, a novel approach in the search step to trigger a local alignment was introduced by Ma et al. (2002). Their program PatternHunter (PH) utilizes a single optimal match pattern to improve the alignment sensitivity. Such an innovation is important since the general sequence search aims to identify more homologous sequences, in which the mismatch positions are unknown. More specifically, PH looks for runs of 18 consecutive nucleotide bases in each sequence, in which the nucleotide matches are required at the 11 positions according to the 1s in the string 111 ∗ 1 ∗ ∗1 ∗ 1 ∗ ∗11 ∗ 111. Such a pattern is called a spaced seed. Even in a personal computer with moderate memory space, PH is able to compare prokaryotic genomes in seconds, Arabidopsis chromosomes in minutes and human or mouse chromosomes in hours (Waterston et al., 2002; Scherer et al., 2003; Ureta-Vidal et al., 2003). The spaced seed idea in PH motivated several research groups to work on the problem of identifying optimal spaced seeds in different sequence alignment models (Keith et al., 2002; Buhler et al., 2003; Brejovà et al., 2003; Choi and Zhang, 2003). Assuming that the similarity of the sequences 1053 ABSTRACT Motivation: Filtration is an important technique used to speed up local alignment as exemplified in the BLAST programs. Recently, Ma et al. discovered that better filtering can be achieved by spacing out the matching positions according to a certain pattern, instead of contiguous positions to trigger a local alignment in their PatternHunter program. Such a match pattern is called a spaced seed. Results: Our numerical computation shows that the ranks of spaced seeds (based on sensitivity) change with the sequences similarity. Since homologous sequences may have diverse similarity, we assess the sensitivity of spaced seeds over a range of similarity levels and present a list of good spaced seeds for facilitating homology search in DNA genomic sequences. We validate that the listed spaced seeds are indeed more sensitive using three arbitrarily chosen pairs of DNA genomic sequences. Contact: K.P.Choi et al. 1054 2 GOOD SPACED SEEDS 2.1 Sensitivity of spaced seeds in local alignment As mentioned before, the BLAST programs look for a perfect match of k contiguous bases that appear in both the query and target sequences in the search step in the filtration method. The novelty of the idea introduced in Ma et al. (2002) is that better filtering can be achieved by spacing out the k matching positions. Since we still require only k matches, better filtering is achieved without sacrificing the speed in the search step. Such a pattern of the matching positions is called a ‘spaced seed’ in their paper. We denote a spaced seed by a string on {1, ∗}, where 1s indicate exact match positions; and ∗s indicate positions which are not required to match (called the ‘don’t care’ positions). Suppose that the spaced seed Q = 1 ∗ ∗11 ∗ 1 is adopted, then for all 7mer from the query sequence, we require a match at the positions 0, 3, 4 and 6 (where we number the positions of 1s in the spaced seed from 0). For example, if the query and target sequences are respectively gcaattgccg and acgattgctg, then the 7mer caattgc and attgccg in the query sequence hit the target sequence at positions 8 and 10, respectively; whereas the 7mer gcaattg does not hit the target sequence at all. Alternatively, we specify a spaced seed by the relative positions of the 1s in the seed (Burkhardt and Kärkkäinen, 2001). For example, the seed Q given above has the set of relative positions {0, 3, 4, 6}. The number of 1s in a seed is called its weight and its overall length is called its length. To measure the sensitivity of a given spaced seed, we adopt the same probability model (PH model) as in Ma et al. (2002). Assume that S and S are two DNA sequences of length n su (...truncated)