Efficient combination of multiple word models for improved sequence comparison (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/20/16/2529.full.pdf

Efficient combination of multiple word models for improved sequence comparison

Xiaoqiu Huang 2 Liang Ye 2 Hui-Hsien Chou 1 2 I-Hsuan Yang 0 Kun-Mao Chao 0 0 Department of Computer Science and Information Engineering, National Taiwan University , Taipei, Taiwan 1 Department of Genetics, Development and Cell Biology, Iowa State University , Ames, IA 50011-1040, USA 2 Department of Computer Science Motivation: Studies of efficient and sensitive sequence comparison methods are driven by a need to find homologous regions of weak similarity between large genomes. Results: We describe an improved method for finding similar regions between two sets of DNA sequences. The new method generalizes existing methods by locating word matches between sequences under two or more word models and extending word matches into high-scoring segment pairs (HSPs). The method is implemented as a computer program named DDS2. Experimental results show that DDS2 can find more HSPs by using several word models than by using one word model. Availability: The DDS2 program is freely available for academic use in binary code form at http://bioinformatics.iastate.edu/aat/align/align.html and in source code form from the corresponding author. Contact: - INTRODUCTION A number of fast comparison programs have been developed for analysis of genomic DNA sequences (Pearson and Lipman, 1988; Altschul et al., 1990, 1997; Gish, unpublished data; Huang et al., 1997; Delcher et al., 1999; Burkhardt et al., 1999; Kurtz and Schleiermacher, 1999; Zhang et al., 2000; Ning et al., 2001; Kent, 2002; Ma et al., 2002; Schwartz et al., 2003). The BLASTN program (Altschul et al., 1990) is widely used for finding homologous similarities between DNA sequences. It computes high-scoring segment pairs (HSPs) between sequences by locating exact word matches of certain length between sequences and extending each word match into an HSP. The PatternHunter program (Ma et al., 2002) enhances BLASTN, in sensitivity, by allowing base differences in word matches. Word matches are defined with respect to a word model. A word model of length k is specified by a binary string of k bits. A position at which the model has a 1 bit is called a checked position. The number of checked positions in the model is the weight of the model. Two words of length k form a word match under a word model if bases at every checked position are identical. For example, the two words ACGTC and ATGAC form a word match under the word model of 10101 of length 5 and weight 3. Note that an exact word match of length k is a word match under a word model of length k and weight k, which is called a consecutive word model of length k. The BLASTZ program (Schwartz et al., 2003) takes the idea of PatternHunter further by allowing a transition (AG, GA, CT or TC) in any one of the checked positions. The sensitivity of a model is the probability of generating a word match in a fixed-length region of a given percentage identity. An optimal word model of length 18 and weight 11 has a sensitivity value of 0.467 for HSPs of length 64 and 70% identity, whereas a consecutive word model of length 11 has a sensitivity value of 0.3 (Ma et al., 2002). More HSPs can be found by PatternHunter in different runs with different models. However, results from the different runs have a large number of HSPs in common and a lot of time is spent on computing HSPs that are already computed in previous runs. In this paper, we describe an efficient algorithm for finding HSPs under a set of word models simultaneously. If an HSP contains a word match under one of the word models, then the HSP is reported and no additional time is spent on the HSP. In addition, HSPs that contain a transitive word match are computed and reported once. Words x and y form a transitive word match if there is a word z such that words x and z form a match under one model in the set, and words y and z form a match under another model in the set. The algorithm is implemented as a computer program named DDS2. Experimental results produced by DDS2 on sequences of human chromosome 21 and mouse chromosome 16 indicate that DDS2 can find more HSPs under a set of three word models than under one optimal word model. We describe an algorithm for computing HSPs between two sets of sequences under a set of word models. The sequences in one set are called query sequences and those in the other are called database sequences. The query and database sequences are concatenated together with a special boundary character inserted at each sequence boundary, where the query sequences are placed before the database sequences. The resulting sequence is called the combined sequence. The concatenation of the two sets of sequences is a slightly efficient way to represent each of the query and database sequence positions by a unique identifier, which is the location of the position in the combined sequence. The unique representation of sequence positions is used by the algorithm to deal with sets of query and database sequence positions. An alternative method is to represent each sequence position by three identifiers: data set id, sequence id and position id. Assume that the first position of the combined sequence starts at 1, where the value 0 is used to indicate that a set of positions is empty. Any word of the combined sequence with the special boundary character or any irregular base is not considered in the following steps. An alphabet of size 4, corresponding to the four regular base types, is used to reduce the space requirement of the algorithm. Two positions of the combined sequence are equivalent under a word model of length k if the words of length k starting at the two positions consist only of regular bases and form a match under the model. Two positions p1 and p2 of the combined sequence are equivalent if there is a word model in the set such that the two positions are equivalent under the model, or there is a position p3 of the combined sequence such that p1 and p3 are equivalent, and p3 and p2 are equivalent. Assume that each position is equivalent to itself. The algorithm for finding HSPs between query and database sequences under the set of word models consists of two major steps. In step 1, the sets of equivalent positions are computed. Then every query position is linked to a list of database positions equivalent to the query position. In step 2, for each query sequence Q, HSPs between Q and the database sequences are computed as follows. For each position q of Q and for each position d in the list of database positions equivalent to q, if the pair of positions q and d is not covered by any HSP that is already computed, then a pair of words starting at the pair of positions is extended into an HSP and the HSP is saved if its score is greater than a cutoff. HSPs between Q and a database sequence are combined into high-scoring chains of HSPs (Wilbur and Lipman, 1983; Huang, 2002). Below we describe step 1 and parts of step 2 in detail. In step 1, initially, each position of the combined sequence is a se (...truncated)