Combining protein evolution and secondary structure. (pdf)

Article PDF cannot be displayed. You can download it here:

https://mbe.oxfordjournals.org/content/13/5/666.full.pdf

Combining protein evolution and secondary structure.

Mol. Biol. Evol. 0737-4038 Combining Protein Evolution and Secondary Structure L. Thorne 0 1 2 3 4 5 Nick 0 1 2 4 5 T. Jones-/-S 0 1 2 4 5 0 2 Present address: Department of Warwick 1 State University; TDivision of Mathematical Structure and Modelling Unit, Department of 2 Department, North Carolina London; and SBiomolecular University College , London 3 Program in Statistical Genetics, Statistics National Institute for Medical Research , Biochemistry and Molecular Biology 4 and reprints: Jeffrey Statistics Department , Raleigh, North Carolina 5 gram in Statistical Genetics, Carolina State University , mail: An evolutionary model that combines protein secondary structure and amino acid replacement is introduced. allows likelihood analysis of aligned protein sequences and does not require the underlying secondary (or tertiary) structures of these sequences to be known. One component of the model describes the organization of secondary structure along a protein sequence and another specifies the evolutionary process for each category of secondary structure. A database of proteins with known secondary structures is used to estimate model parameters representing these two components. Phylogeny, the third component of the model, can be estimated from the data set of interest. As an example, we employ our model to analyze a set of sucrose synthase sequences. For the evolution of sucrose synthase, a parametric bootstrap approach indicates that our model is statistically preferable to one that ignores secondary structure. It is widely recognized that evolutionary divergence of protein structures occurs much less rapidly than divergence of protein sequences (e.g., Chothia and Lesk 1986; Flores et al. 1993). This indicates that selective constraints may act to preserve protein structure. The nature of these constraints is poorly understood and they have received relatively little direct attention in the areas of population genetics and phylogeny reconstruction. We believe this lack of attention should be rectified. The process of molecular evolution will not be well understood until the constraints that affect it have been characterized. There has been less work on modelling amino acid replacement than on modelling nucleotide substitution. This disparity may be attributable to the extra complexity of modelling the replacement process. Replacements tend to occur between chemically similar amino acid types and a replacement model should reflect this. There are many ways to categorize amino acids by chemical properties (e.g., hydrophobicity, charge, relative size of side chain), and physicochemical distances between amino acid types have been suggested (Grantham 1974; Taylor and Jones 1993), but these categorizations or physicochemical distances may not directly reflect the differences among amino acid types that are acted upon by evolution. Cognizant of the difficulty of creating a realistic model for protein evolution based solely on physicochemical principles, Dayhoff and coworkers (Dayhoff, Eck, and Eck 1972; Dayhoff, Schwartz, and Orcutt 1978) developed an empirical approach. To construct hidden ular evolution; phylogeny - their empirical amino acid transition matrix, sets of easily aligned (i.e., closely related) sequences were col- D lected. Only closely related sequences were considered ow because, when evolutionary distance is sufficiently lno small, the possibility of multiple replacements can be aed itgonocroends.trucTthe aobpsreorbvaebdilistriecplacerempelanctemepnattternmsodwele.re Wuhseend rfodm the Dayhoff model was originally proposed, relatively tth few protein sequences were known. More recent studies ://p (Gonnet, Cohen, and Benner 1992; Jones, Taylor, and bm Thornton 1992) followed the spirit of the Dayhoff ap- .eox proach but were able to tabulate a larger number of ob- fro served replacements. Because the Dayhoff model is em- joud pirical, it reflects the fact that different amino acid types ran are replaced at different rates and the fact that amino .lso acids are usually replaced by chemically similar amino /rg acids. by A problem with the Dayhoff approach is that itegu effectively models the replacement process at the av- tso erage site in the average protein. There may be no nF such thing as an average site in an average protein. reb The physical environment of a protein site may greatly rayu influence the replacement process at the site. Therefore, ,6 there may be variation among sites in the replacement 021 process. Important features of the physical environment 5 might include the secondary structure and whether the site is on the surface or in the interior of a protein. There have been tabulations of the observed number of amino acid replacements for each of several different categories of physical environment (e.g., Overington et al. 1990; Liithy, McLachlan, and Eisenberg 1991; Topham et al. 1993; Wako and Blundell 1994), but the potential of these tabulations to disentangle phylogenetic correlations and constraints due to physical environment has not been exploited previously. In this study, we introduce a probabilistic model that relates protein secondary structure to protein evolution. Our empirical approach is similar to the Dayhoff procedure but we model the replacement process for each of several categories of physical environment. We classify the physical environment of a site according to the secondary structure at the site. Three categories of secondary structure (a-helix, B-sheet, and loop) are being employed. The term loop is being used loosely here to indicate that a site is neither in an o-helix nor in a P-sheet. In future studies, we plan to consider more detailed classification schemes. The Model The Known Data Set We utilize a data set maintained by one of us (D.T.J.) that contains representative sequences of 207 protein families. Protein families are only included in the data set if the tertiary structure of at least one member of the family has been experimentally determined. We will refer to this data set as the known structure data set. Most of the protein families in the known structure data set are represented by sequences in addition to the one with known tertiary structure. The members of each protein family are aligned. Fortunately, alignment is not difficult for these sequences because the sequences included in the data set are relatively similar to one another. To convert information in the known structure data set from three-dimensional structure to secondary structure categories, we rely on the DSSP computer program (Kabsch and Sander 1983). This program can determine the secondary structure of a protein from the atomic coordinates that specify its tertiary structure. We proceed with the assumption that the category of underlying secondary structure is identical for residues of different sequences that are in the same column of the alignment. This assumption is reasonable because, as noted earlier, structural (...truncated)