Tracking repeats using significance and transitivity
Radek Szklarczyk
0
Jaap Heringa
0
0
Centre for Integrative Bioinformatics (IBIVU), Faculty of Sciences and Faculty of Earth and Life Sciences, Vrije Universiteit Amsterdam
, De Boelelaan 1081A, 1081 HV Amsterdam,
The Netherlands
Motivation: Internal repeats in coding sequences correspond to structural and functional units of proteins. Moreover, duplication of fragments of coding sequences is known to be a mechanism to facilitate evolution. Identification of repeats is crucial to shed light on the function and structure of proteins, and explain their evolutionary past. The task is difficult because during the course of evolution many repeats diverged beyond recognition. Results: We introduce a new method TRUST, for ab initio determination of internal repeats in proteins. It provides an improvement in prediction quality as compared to alternative state-of-the-art methods. The increased sensitivity and accuracy of the method is achieved by exploiting the concept of transitivity of alignments. Starting from significant local suboptimal alignments, the application of transitivity allows us to (1) identify distant repeat homologues for which no alignments were found; (2) gain confidence about consistently well-aligned regions; and (3) recognize and reduce the contribution of non-homologous repeats. This re-assessment step enables us to derive a virtually noise-free profile representing a generalized repeat with high fidelity. We also obtained superior specificity by employing rigid statistical testing for selfsequence and profile-sequence alignments. Assessment was done using a database of repeat annotations based on structural superpositioning. The results show that TRUST is a useful and reliable tool for mining tandem and non-tandem repeats in protein sequence databases, capable of predicting multiple repeat types with varying intervening segments within a single sequence. Availability: The TRUST server (together with the source code) is available at http://ibivu.cs.vu.nl/programs/trustwww Contact:
1 INTRODUCTION
Internal repeats within protein sequences have been intensely
studied since they have wide-ranging implications for the
evolution and function of proteins. A classical example is
chymotrypsin, which evolved through the duplication of an
ancestral barrel domain, such that the active site of the
modern protein comprises amino acids of either domain (Heringa,
1994). Another example is the zinc finger domain, a frequent
constituent of transcription factors involved in DNA binding,
where the composition and copy number of individual tandem
repeats confers selectivity and activity of DNA binding.
Proper delineation of repeats at the sequence level is not
only important for understanding the structure and function
of proteins, but is also crucial for the detection of homologous
sequences and other techniques based on sequence analysis.
This is because repeats often pose a problem for alignment
methods that normally are ill-prepared to deal with them.
In this paper, we introduce the method TRUST (Tracking
Repeats Using Significance and Transitivity), which is
capable of detecting internal sequence repeats based on sequence
information of an individual sequence alone. The method
exploits the concept of transitivity of alignments as well as
a statistical scheme optimized for the evaluation of repeat
significance.
Algorithm
The TRUST algorithm detects repeats without any prior
knowledge. It relies on a scheme to assess the statistical
significance (P -value) of repeat alignment scores, as opposed
to various parameters and arbitrary thresholds used by other
methods. However, the key strategy of the method is to employ
transitivity: using logical inference from alignments, we
introduce new information that can identify distant homologous
regions and at the same time can support or contradict
existing suboptimal alignments. The transitivity scheme enables
us to calculate the repeat length accurately, and allows the
generation of virtually noise-free and sensitive profiles.
2.1.1 Extracting alignments Detection of suboptimal
alignments is performed with the WatermanEggert algorithm
(Waterman and Eggert, 1987). In self-sequence comparison,
the highest-scoring alignment trivially covers the diagonal of
the dynamic-programming matrix: therefore, we mask the
Fig. 1. (a) Matrix with the best-scoring self-alignments within the
sequence PVALVALPVAL. Each black cell represents a pair of
residues matched in a local alignment. The matrix diagonal and
lower triangle are not shown. (b) Equivalent graph representation
of the alignments from (a), where residues aligned are connected by
edges.
matrix diagonal before the procedure starts. Note that in the
self-comparison, the lower and upper triangle of the matrix
are symmetrical.
An alignment can be represented as a number of dots
in a two-dimensional (2D) matrix, each dot representing a
matched residue pair; we call such a sequence of dots a
trace (Fig. 1). A value is assigned to each trace: for traces
representing alignments the value is simply the alignment
score (Fig. 2a). We will use the terms alignment and trace
interchangeably.
2.1.2 Estimating the significance of the alignments To
assess the biological significance of suboptimal alignments
containing repeats, we use P -values, defined as the
probability of obtaining an alignment with the same score by
self-alignment of scrambled sequences. Alignments with
P -values lower than the default threshold of 1% are
considered significant and are included in further analysis.
The distribution of the scores of highest-scoring local
alignments in random sequences can be approximated with the
Extreme Value Distribution (EVD) (Gumbel, 1958). When
no gaps are allowed in the alignments (gap penalty = ),
the distribution of the highest alignment scores is provea to
follow the EVD (Karlin and Altschul, 1990). Partial results
and further empirical evidence (e.g. Waterman and Vingron,
1994a,b; Vingron and Waterman, 1994; Altschul and Gish,
1996) strongly suggests that the same distribution also applies
to alignments with gaps. A benefit of the Extreme Value theory
is the ease with which the distribution can be approximated,
with only a limited number of scrambled sequences. We
therefore determine the distributions for self-sequence alignment
and profile-sequence alignment for each query sequence on
the fly.
2.1.3 Transitivity Transitivity of alignments has been
successfully employed in the field of sequence analysis (e.g.
Notredame et al., 2000) The effect of transitivity is
illustrated in Figure 3. We use transitivity in the following way:
if a residue i is matched with a residue j , and j is aligned to
k as well, then we infer a correspondence between residues
i and k (Fig. 3a). If there already exists a significant
alignment containing a match between residues i and k, its validity
becomes supported by the transitive alignment. In case this
match did not exist between i and k, the inferred (...truncated)