Identifying distantly related protein sequences
CABIOS INVITED REVIEW
Vol. 13 no. 4 1997
Pages 325-332
Identifying distantly related protein
sequences
William R.Pearson
Introduction
Protein sequence comparison is a powerful tool because of
the enormous amount of information that is preserved
throughout the evolutionary process. For many protein
sequences, an evolutionary history can be traced back 12.5 billion years. Proteins that share a common ancestor are
called homologous. Sequence comparison is most informative when it detects homologous proteins. Homologous
proteins always share a common three-dimensional folding
structure and they often share common active sites or binding
domains. Frequently, homologous proteins share common
functions, but sometimes they do not. Our ability to
characterize the biological properties of a protein based on
sequence data alone stems almost exclusively from properties
conserved through evolutionary time. Predictions of common
properties for non-homologous proteins—similarities that
have arisen by convergence—are much less reliable.
While sequence similarity searching is a routine method
for characterizing newly determined DNA and protein
sequences, researchers sometimes fail to exploit fully the
information that is available from similarity searches of
protein sequence databases. This review examines two
strategies for using similarity search information more
effectively: (i) looking for alignments that span an entire
folding domain, rather than a short sequence motif, and (ii)
Department of Biochemistry, Jordan Hall #440, University of Virginia,
Charlottesville, VA 22908, USA
E-mail:
© Oxford University Press
Motifs, homology, and the serine proteases
A common misconception in protein sequence comparison is
that homologous proteins share sequence similarity mostly
(or only) near the active site regions or other functional
domains in a protein. This partly accounts for the popularity
of databases of sequence motifs, such as PROSITE (Bairoch,
1991), which tabulate amino acid patterns that can be used to
identify most of the members of a protein family. For features
that result from convergence to a common property, such as
glycosylation and phosphorylation sites, sequence motifs are
uniquely informative. However, for features that result from
divergence from a common ancestor, such as the serine
protease active site residues, sequence motifs provide only a
highly abstracted summary of the sequence conservation in a
family. Because they share a common three-dimensional
structure, homologous proteins share sequence similarity
over large regions-typical ly the entire protein fold.
The trypsin-like serine protease superfamily is a classic
example of a protein family whose members share several
simple motifs that are diagnostic for the family (Figure 1).
325
The most powerful method available today for inferring the
biological function of a gene (or the protein that it encodes)
from its sequence is similarity searching on protein and DNA
sequence databases. With the development of rapid methods
for sequence comparison, both with heuristic algorithms and
powerful parallel computers, discoveries based solely on
sequence homology have become routine. Indeed, the vast
majority of the gene identifications in the recent descriptions
of the Haemophilus influenzae (Fleischmann et ai, 1995),
Mycoplasma genitalium (Fraser et ai, 1995), yeast (Dujon,
1996) and Methanococcus janesscii (Bult et ai, 1996)
genomes are based only on protein sequence similarity. As
more complete genomes become available, protein sequence
comparison will become an even more powerful tool for
understanding biological function.
re-examining sequences with high, but not statistically
significant, similarity scores. For a broader perspective on
sequence comparison and identification of homologous
proteins, see Altschul et al. (1994) and Pearson (1996).
Members of the trypsin-like serine protease superfamily
('trypsin-like' distinguishes these serine proteases from other
serine protease families—notably the subtilisins—that use
serine in the active site but have very different structures and
thus are not homologous) provide a classic example of a
family of proteins with a highly conserved active site. While
highly conserved motifs from this site are informative, serine
proteases share similarity throughout the length of the
protease domain, not just around the active site residues.
The trypsin-like serine protease family is quite diverse,
with a number of very distantly related homologues. Thus, it
can be difficult to demonstrate that Streptomyces griseus
protease A and protease B are homologous based on sequence
similarity alone. The second part of this review shows that by
carefully re-examining sequences with high-scoring, but not
statistically significant, similarity scores, it is possible to
identify several proteins that share significant similarity with
both the mammalian trypsin-like serine proteases and their
distant prokaryotic homologues.
W.R.Pearson
ID
AC
TRYPSIN_HIS;
PS00134;
PATTERN.
DE
PA
Serine proteases, trypsin family, histidine active site.
[LIVM]-[ST]-A-[STAG]-H-C.
/TOTAL=158(158);
NR
CC
CC
/FALSE_NEG=11(11);
/TAXO-RANGE=??EP?; /MAX-REPEAT=1;
/SITE=5,active_site;
/POSITIVE=154(154);
/UNKNOWNS(2);
/FALSE_POS=2(2);
ID
AC
DE
PA
NR
NR
CC
CC
TRYPSIN_SER; PATTERN.
PS00135;
Serine proteases, trypsin family, serine active site.
G-D-S-G-G.
/TOTAL=160(160); /POSITIVE=151(151); /UNKNOWN^1(1); /FALSE_POS=8(8);
/FALSE_NEG=16(16) ;
/TAXO-RANGE=??EP?; /MAX-REPEAT=1;
/SITE=3,active_site;
Fig. 1. Patterns for serine proteases. Patterns from PROSITE that identify 152/163 TRYPSIN_HIS or 143/159 TRYPSIN_SER members of the trypsin-like
serine protease protein family.
Serine proteases cleave peptide bonds using a 'catalytic triad'
of histidine, serine and aspartic acid that are required for the
protease function. Because these residues are so highly
conserved, patterns that focus on two of the regions (Figure 1)
can be used to identify every member of the serine protease
family. (The subtilisin-like serine proteases use exactly the
same catalytic triad, but the families are non-homologous
with very different three-dimensional structures.)
Most members of the trypsin-like serine protease superfamily are readily identified by sequence similarity searching.
The results from a typical protein database search using the
Smith-Waterman algorithm (Smith and Waterman, 1981) are
shown in Figure 2. All of the eukaryotic trypsin-like serine
proteases share statistically significant similarity with the
bovine trypsin query sequence. However, as is often the case
for divergent protein families, some prokaryotic members of
the family do not share statistically significant similarity with
bovine trypsin. These sequences are italicized in Figure 2;
their membership in the serine protease family is usually
inferred from their common three-dimensional structures
(Figure 5).
The absolute conservation of residues in the (...truncated)