Tracking repeats using significance and transitivity

Bioinformatics, Aug 2004

Motivation: Internal repeats in coding sequences correspond to structural and functional units of proteins. Moreover, duplication of fragments of coding sequences is known to be a mechanism to facilitate evolution. Identification of repeats is crucial to shed light on the function and structure of proteins, and explain their evolutionary past. The task is difficult because during the course of evolution many repeats diverged beyond recognition. Results: We introduce a new method TRUST, for ab initio determination of internal repeats in proteins. It provides an improvement in prediction quality as compared to alternative state-of-the-art methods. The increased sensitivity and accuracy of the method is achieved by exploiting the concept of transitivity of alignments. Starting from significant local suboptimal alignments, the application of transitivity allows us to (1) identify distant repeat homologues for which no alignments were found; (2) gain confidence about consistently well-aligned regions; and (3) recognize and reduce the contribution of non-homologous repeats. This re-assessment step enables us to derive a virtually noise-free profile representing a generalized repeat with high fidelity. We also obtained superior specificity by employing rigid statistical testing for self-sequence and profile-sequence alignments. Assessment was done using a database of repeat annotations based on structural superpositioning. The results show that TRUST is a useful and reliable tool for mining tandem and non-tandem repeats in protein sequence databases, capable of predicting multiple repeat types with varying intervening segments within a single sequence. Availability: The TRUST server (together with the source code) is available at http://ibivu.cs.vu.nl/programs/trustwww

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/20/suppl_1/i311.full.pdf

Tracking repeats using significance and transitivity

Radek Szklarczyk 0 Jaap Heringa 0 0 Centre for Integrative Bioinformatics (IBIVU), Faculty of Sciences and Faculty of Earth and Life Sciences, Vrije Universiteit Amsterdam , De Boelelaan 1081A, 1081 HV Amsterdam, The Netherlands Motivation: Internal repeats in coding sequences correspond to structural and functional units of proteins. Moreover, duplication of fragments of coding sequences is known to be a mechanism to facilitate evolution. Identification of repeats is crucial to shed light on the function and structure of proteins, and explain their evolutionary past. The task is difficult because during the course of evolution many repeats diverged beyond recognition. Results: We introduce a new method TRUST, for ab initio determination of internal repeats in proteins. It provides an improvement in prediction quality as compared to alternative state-of-the-art methods. The increased sensitivity and accuracy of the method is achieved by exploiting the concept of transitivity of alignments. Starting from significant local suboptimal alignments, the application of transitivity allows us to (1) identify distant repeat homologues for which no alignments were found; (2) gain confidence about consistently well-aligned regions; and (3) recognize and reduce the contribution of non-homologous repeats. This re-assessment step enables us to derive a virtually noise-free profile representing a generalized repeat with high fidelity. We also obtained superior specificity by employing rigid statistical testing for selfsequence and profile-sequence alignments. Assessment was done using a database of repeat annotations based on structural superpositioning. The results show that TRUST is a useful and reliable tool for mining tandem and non-tandem repeats in protein sequence databases, capable of predicting multiple repeat types with varying intervening segments within a single sequence. Availability: The TRUST server (together with the source code) is available at http://ibivu.cs.vu.nl/programs/trustwww Contact: 1 INTRODUCTION Internal repeats within protein sequences have been intensely studied since they have wide-ranging implications for the evolution and function of proteins. A classical example is chymotrypsin, which evolved through the duplication of an ancestral barrel domain, such that the active site of the modern protein comprises amino acids of either domain (Heringa, 1994). Another example is the zinc finger domain, a frequent constituent of transcription factors involved in DNA binding, where the composition and copy number of individual tandem repeats confers selectivity and activity of DNA binding. Proper delineation of repeats at the sequence level is not only important for understanding the structure and function of proteins, but is also crucial for the detection of homologous sequences and other techniques based on sequence analysis. This is because repeats often pose a problem for alignment methods that normally are ill-prepared to deal with them. In this paper, we introduce the method TRUST (Tracking Repeats Using Significance and Transitivity), which is capable of detecting internal sequence repeats based on sequence information of an individual sequence alone. The method exploits the concept of transitivity of alignments as well as a statistical scheme optimized for the evaluation of repeat significance. Algorithm The TRUST algorithm detects repeats without any prior knowledge. It relies on a scheme to assess the statistical significance (P -value) of repeat alignment scores, as opposed to various parameters and arbitrary thresholds used by other methods. However, the key strategy of the method is to employ transitivity: using logical inference from alignments, we introduce new information that can identify distant homologous regions and at the same time can support or contradict existing suboptimal alignments. The transitivity scheme enables us to calculate the repeat length accurately, and allows the generation of virtually noise-free and sensitive profiles. 2.1.1 Extracting alignments Detection of suboptimal alignments is performed with the WatermanEggert algorithm (Waterman and Eggert, 1987). In self-sequence comparison, the highest-scoring alignment trivially covers the diagonal of the dynamic-programming matrix: therefore, we mask the Fig. 1. (a) Matrix with the best-scoring self-alignments within the sequence PVALVALPVAL. Each black cell represents a pair of residues matched in a local alignment. The matrix diagonal and lower triangle are not shown. (b) Equivalent graph representation of the alignments from (a), where residues aligned are connected by edges. matrix diagonal before the procedure starts. Note that in the self-comparison, the lower and upper triangle of the matrix are symmetrical. An alignment can be represented as a number of dots in a two-dimensional (2D) matrix, each dot representing a matched residue pair; we call such a sequence of dots a trace (Fig. 1). A value is assigned to each trace: for traces representing alignments the value is simply the alignment score (Fig. 2a). We will use the terms alignment and trace interchangeably. 2.1.2 Estimating the significance of the alignments To assess the biological significance of suboptimal alignments containing repeats, we use P -values, defined as the probability of obtaining an alignment with the same score by self-alignment of scrambled sequences. Alignments with P -values lower than the default threshold of 1% are considered significant and are included in further analysis. The distribution of the scores of highest-scoring local alignments in random sequences can be approximated with the Extreme Value Distribution (EVD) (Gumbel, 1958). When no gaps are allowed in the alignments (gap penalty = ), the distribution of the highest alignment scores is provea to follow the EVD (Karlin and Altschul, 1990). Partial results and further empirical evidence (e.g. Waterman and Vingron, 1994a,b; Vingron and Waterman, 1994; Altschul and Gish, 1996) strongly suggests that the same distribution also applies to alignments with gaps. A benefit of the Extreme Value theory is the ease with which the distribution can be approximated, with only a limited number of scrambled sequences. We therefore determine the distributions for self-sequence alignment and profile-sequence alignment for each query sequence on the fly. 2.1.3 Transitivity Transitivity of alignments has been successfully employed in the field of sequence analysis (e.g. Notredame et al., 2000) The effect of transitivity is illustrated in Figure 3. We use transitivity in the following way: if a residue i is matched with a residue j , and j is aligned to k as well, then we infer a correspondence between residues i and k (Fig. 3a). If there already exists a significant alignment containing a match between residues i and k, its validity becomes supported by the transitive alignment. In case this match did not exist between i and k, the inferred (...truncated)


This is a preview of a remote PDF: https://bioinformatics.oxfordjournals.org/content/20/suppl_1/i311.full.pdf
Article home page: http://bioinformatics.oxfordjournals.org/content/20/suppl_1/i311.abstract

Radek Szklarczyk, Jaap Heringa. Tracking repeats using significance and transitivity, Bioinformatics, 2004, pp. i311-i317, 20/suppl 1, DOI: 10.1093/bioinformatics/bth911