Identifying distantly related protein sequences

Bioinformatics, Aug 1997

William R. Pearson; Identifying distantly related protein sequences, Bioinformatics, Volume 13, Issue 4, 1 August 1997, Pages 325–332, https://doi.org/10.1

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/bioinformatics/article-pdf/13/4/325/768803/13-4-325.pdf

Identifying distantly related protein sequences

CABIOS INVITED REVIEW Vol. 13 no. 4 1997 Pages 325-332 Identifying distantly related protein sequences William R.Pearson Introduction Protein sequence comparison is a powerful tool because of the enormous amount of information that is preserved throughout the evolutionary process. For many protein sequences, an evolutionary history can be traced back 12.5 billion years. Proteins that share a common ancestor are called homologous. Sequence comparison is most informative when it detects homologous proteins. Homologous proteins always share a common three-dimensional folding structure and they often share common active sites or binding domains. Frequently, homologous proteins share common functions, but sometimes they do not. Our ability to characterize the biological properties of a protein based on sequence data alone stems almost exclusively from properties conserved through evolutionary time. Predictions of common properties for non-homologous proteins—similarities that have arisen by convergence—are much less reliable. While sequence similarity searching is a routine method for characterizing newly determined DNA and protein sequences, researchers sometimes fail to exploit fully the information that is available from similarity searches of protein sequence databases. This review examines two strategies for using similarity search information more effectively: (i) looking for alignments that span an entire folding domain, rather than a short sequence motif, and (ii) Department of Biochemistry, Jordan Hall #440, University of Virginia, Charlottesville, VA 22908, USA E-mail: © Oxford University Press Motifs, homology, and the serine proteases A common misconception in protein sequence comparison is that homologous proteins share sequence similarity mostly (or only) near the active site regions or other functional domains in a protein. This partly accounts for the popularity of databases of sequence motifs, such as PROSITE (Bairoch, 1991), which tabulate amino acid patterns that can be used to identify most of the members of a protein family. For features that result from convergence to a common property, such as glycosylation and phosphorylation sites, sequence motifs are uniquely informative. However, for features that result from divergence from a common ancestor, such as the serine protease active site residues, sequence motifs provide only a highly abstracted summary of the sequence conservation in a family. Because they share a common three-dimensional structure, homologous proteins share sequence similarity over large regions-typical ly the entire protein fold. The trypsin-like serine protease superfamily is a classic example of a protein family whose members share several simple motifs that are diagnostic for the family (Figure 1). 325 The most powerful method available today for inferring the biological function of a gene (or the protein that it encodes) from its sequence is similarity searching on protein and DNA sequence databases. With the development of rapid methods for sequence comparison, both with heuristic algorithms and powerful parallel computers, discoveries based solely on sequence homology have become routine. Indeed, the vast majority of the gene identifications in the recent descriptions of the Haemophilus influenzae (Fleischmann et ai, 1995), Mycoplasma genitalium (Fraser et ai, 1995), yeast (Dujon, 1996) and Methanococcus janesscii (Bult et ai, 1996) genomes are based only on protein sequence similarity. As more complete genomes become available, protein sequence comparison will become an even more powerful tool for understanding biological function. re-examining sequences with high, but not statistically significant, similarity scores. For a broader perspective on sequence comparison and identification of homologous proteins, see Altschul et al. (1994) and Pearson (1996). Members of the trypsin-like serine protease superfamily ('trypsin-like' distinguishes these serine proteases from other serine protease families—notably the subtilisins—that use serine in the active site but have very different structures and thus are not homologous) provide a classic example of a family of proteins with a highly conserved active site. While highly conserved motifs from this site are informative, serine proteases share similarity throughout the length of the protease domain, not just around the active site residues. The trypsin-like serine protease family is quite diverse, with a number of very distantly related homologues. Thus, it can be difficult to demonstrate that Streptomyces griseus protease A and protease B are homologous based on sequence similarity alone. The second part of this review shows that by carefully re-examining sequences with high-scoring, but not statistically significant, similarity scores, it is possible to identify several proteins that share significant similarity with both the mammalian trypsin-like serine proteases and their distant prokaryotic homologues. W.R.Pearson ID AC TRYPSIN_HIS; PS00134; PATTERN. DE PA Serine proteases, trypsin family, histidine active site. [LIVM]-[ST]-A-[STAG]-H-C. /TOTAL=158(158); NR CC CC /FALSE_NEG=11(11); /TAXO-RANGE=??EP?; /MAX-REPEAT=1; /SITE=5,active_site; /POSITIVE=154(154); /UNKNOWNS(2); /FALSE_POS=2(2); ID AC DE PA NR NR CC CC TRYPSIN_SER; PATTERN. PS00135; Serine proteases, trypsin family, serine active site. G-D-S-G-G. /TOTAL=160(160); /POSITIVE=151(151); /UNKNOWN^1(1); /FALSE_POS=8(8); /FALSE_NEG=16(16) ; /TAXO-RANGE=??EP?; /MAX-REPEAT=1; /SITE=3,active_site; Fig. 1. Patterns for serine proteases. Patterns from PROSITE that identify 152/163 TRYPSIN_HIS or 143/159 TRYPSIN_SER members of the trypsin-like serine protease protein family. Serine proteases cleave peptide bonds using a 'catalytic triad' of histidine, serine and aspartic acid that are required for the protease function. Because these residues are so highly conserved, patterns that focus on two of the regions (Figure 1) can be used to identify every member of the serine protease family. (The subtilisin-like serine proteases use exactly the same catalytic triad, but the families are non-homologous with very different three-dimensional structures.) Most members of the trypsin-like serine protease superfamily are readily identified by sequence similarity searching. The results from a typical protein database search using the Smith-Waterman algorithm (Smith and Waterman, 1981) are shown in Figure 2. All of the eukaryotic trypsin-like serine proteases share statistically significant similarity with the bovine trypsin query sequence. However, as is often the case for divergent protein families, some prokaryotic members of the family do not share statistically significant similarity with bovine trypsin. These sequences are italicized in Figure 2; their membership in the serine protease family is usually inferred from their common three-dimensional structures (Figure 5). The absolute conservation of residues in the (...truncated)


This is a preview of a remote PDF: https://academic.oup.com/bioinformatics/article-pdf/13/4/325/768803/13-4-325.pdf
Article home page: https://academic.oup.com/bioinformatics/article/13/4/325/274640

Pearson, William R.. Identifying distantly related protein sequences, Bioinformatics, 1997, pp. 325-332, Volume 13, Issue 4, DOI: 10.1093/bioinformatics/13.4.325