PatternHunter: faster and more sensitive homology search (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/18/3/440.full.pdf

PatternHunter: faster and more sensitive homology search

Vol. 18 no. 3 2002 Pages 440–445 BIOINFORMATICS PatternHunter: faster and more sensitive homology search Bin Ma 1, John Tromp 2 and Ming Li 3 1 Computer Science Department, University of Western Ontario, London N6A 5B8, Canada, 2 Bioinformatics Solutions Inc., 145 Columbia Street West, Waterloo, Ont. N2L 3L2, Canada and 3 Computer Science Department, University of Waterloo, Waterloo, Ont. N2L 3G1, Canada and Bioinformatics Lab, Computer Science Department, University of California, Santa Barbara, CA 93106, USA Received on August 24, 2001; revised on October 10, 2001; accepted on October 15, 2001 ABSTRACT Motivation: Genomics and proteomics studies routinely depend on homology searches based on the strategy of finding short seed matches which are then extended. The exploding genomic data growth presents a dilemma for DNA homology search techniques: increasing seed size decreases sensitivity whereas decreasing seed size slows down computation. Results: We present a new homology search algorithm ‘PatternHunter’ that uses a novel seed model for increased sensitivity and new hit-processing techniques for significantly increased speed. At Blast levels of sensitivity, PatternHunter is able to find homologies between sequences as large as human chromosomes, in mere hours on a desktop. Availability: PatternHunter is available at http://www. bioinformaticssolutions.com, as a commercial package. It runs on all platforms that support Java. PatternHunter technology is being patented; commercial use requires a license from BSI, while non-commercial use will be free. Contact: INTRODUCTION We are interested in faster and more sensitive methods for finding all approximate repeats or homologies in one DNA sequence or between two DNA sequences, as performed by the popular Blastn (Altschul et al., 1990) program. One particular application of this task is in comparative genomics where large genomes or chromosomes such as the human one (International Human Genome Sequencing Consortium, 2001; Venter et al., 2001) need to be compared. Many programs have been developed for the task. These include FASTA (Lipman and Pearson, 1985), SIM (Huang and Miller, 1991), the Blast family (Altschul et al., 1990; Gish, 2001; Altschul et al., 1997; Zhang et al., 2000; Tatusova and Madden, 1999), SENSEI (States, 2000), 440 MUMmer (Delcher et al., 1999), QUASAR (Burkhardt et al., 1999), and REPuter (Kurtz and Schleiermacher, 1999). Smith–Waterman alignment which compares all bases against all bases is clearly too slow. Two lines of approach lead to improvements. The first is exemplified by Blast, which is used routinely by thousands of scientists. This approach finds short exact ‘seed’ matches (hits), which are then extended into longer alignments. However, when comparing two very long sequences, FASTA, SIM, Blastn (BL2SEQ), WU-Blast, and Psi-Blast run very slow and need large amounts of memory. SENSEI is somewhat faster and uses much less memory than the above programs, but is currently limited to ungapped alignments. MegaBlast runs quite efficiently with its default gap scores and large seed length of 28 but turns out to have worse output quality and doesn’t scale as well to huge sequences. Another line of approach, exemplified by MUMmer, QUASAR and REPuter, uses suffix trees. Suffix trees suffer from two problems: they are meant to deal with precise matches and are limited to comparison of highly similar sequences (Delcher et al., 1999; Burkhardt et al., 1999; Kurtz and Schleiermacher, 1999). They are very awkward in handling mismatches. The second problem with suffix trees is that they have an intrinsic large space requirement. We introduce novel seeding schemes and hit-processing methods, which are implemented in our program PatternHunter. On a modern desktop, its running time ranges from seconds for prokaryotic genomes to minutes for Arabidopsis chromosomes to hours for human chromosomes, with very modest memory use, and at provably higher sensitivity than the default Blastn. SELECTING GOOD SEEDS: EXPECT LESS TO GET MORE A dilemma for a Blast type of search is that large seeds lose distant homologies while small ones creates too many c Oxford University Press 2002 PatternHunter † For statistical purposes, we count overlapping hits separately, while the Blast program ignores hits overlapping the last recorded one. 1 110100110010101111 0.9 11111111111 0.8 1111111111 0.7 sensitivity random hits which slow down the computation. We use a new idea that allows us to have a higher probability of a hit in a homologous region, even while having somewhat lower expected number of random hits. Blast looks for matches of k (default k = 11 in Blastn and k = 28 in MegaBlast) consecutive letters as seeds. Instead we propose to use nonconsecutive k letters as seeds. We call the relative positions of the k letters a model, and k its weight. This seemingly simple change has a surprisingly large effect on sensitivity. An appropriately chosen model can have a significantly higher probability of having at least one hit in a homologous region, compared to Blast’s consecutive seed model, even while having a lower expected number of hits† . For example, in a region of length 64 with 70% identity, Blast’s consecutive weight 11 model has a 0.30 probability of having at least one hit in the range, while a nonconsecutive model of the same weight has a 0.466 probability of getting a hit, see Figure 1. On the other hand, the expected number of hits in that region by the Blast consecutive model is 1.07, while the nonconsecutive model expects 0.93 hits. This is because the length 11 model can shift over 54 places within the length 64 window, while the length 18 model has only 47 places to fit. The reason for the increased sensitivity is that the events, of having a match at different positions, become more independent for spaced models. If a model and a shifted copy share many 1s in the same position, then a base mismatch in any of these shared positions will make both matches fail, hence the corresponding matching events are far from independent. Independent events are better at pooling their success probabilities together. Generally, the fewer bases shared by a model and any of its shifted copies, the higher its sensitivity is. Clearly, by this measure, consecutive models are the worst, since shift of 1 shares all but one bases. For convenience, we denote a model by a 0–1 string, where the 1-positions represent required matches, while the 0s are ‘don’t cares’. For example, if we use a weight six model 1110111, then actgact versus acttact is a seed match, as well as actgact versus actgact. So Blast uses models of the form 1k . Blast actually matches two or three bytes, each containing four bases, simultaneously, and extends these hits to the left and right. This is fine for the default of k = 11 because any length 11 match necessarily contains a match of two bytes, but for k smaller than 11, it will miss some seeds. (...truncated)