Combining protein evolution and secondary structure.
Mol. Biol. Evol.
0737-4038
Combining Protein Evolution and Secondary Structure
L. Thorne 0 1 2 3 4 5
Nick 0 1 2 4 5
T. Jones-/-S 0 1 2 4 5
0 2 Present address: Department of Warwick
1 State University; TDivision of Mathematical Structure and Modelling Unit, Department of
2 Department, North Carolina London; and SBiomolecular University College , London
3 Program in Statistical Genetics, Statistics National Institute for Medical Research , Biochemistry and Molecular Biology
4 and reprints: Jeffrey Statistics Department , Raleigh, North Carolina
5 gram in Statistical Genetics, Carolina State University , mail:
An evolutionary model that combines protein secondary structure and amino acid replacement is introduced. allows likelihood analysis of aligned protein sequences and does not require the underlying secondary (or tertiary) structures of these sequences to be known. One component of the model describes the organization of secondary structure along a protein sequence and another specifies the evolutionary process for each category of secondary structure. A database of proteins with known secondary structures is used to estimate model parameters representing these two components. Phylogeny, the third component of the model, can be estimated from the data set of interest. As an example, we employ our model to analyze a set of sucrose synthase sequences. For the evolution of sucrose synthase, a parametric bootstrap approach indicates that our model is statistically preferable to one that ignores secondary structure. It is widely recognized that evolutionary divergence of protein structures occurs much less rapidly than divergence of protein sequences (e.g., Chothia and Lesk 1986; Flores et al. 1993). This indicates that selective constraints may act to preserve protein structure. The nature of these constraints is poorly understood and they have received relatively little direct attention in the areas of population genetics and phylogeny reconstruction. We believe this lack of attention should be rectified. The process of molecular evolution will not be well understood until the constraints that affect it have been characterized. There has been less work on modelling amino acid replacement than on modelling nucleotide substitution. This disparity may be attributable to the extra complexity of modelling the replacement process. Replacements tend to occur between chemically similar amino acid types and a replacement model should reflect this. There are many ways to categorize amino acids by chemical properties (e.g., hydrophobicity, charge, relative size of side chain), and physicochemical distances between amino acid types have been suggested (Grantham 1974; Taylor and Jones 1993), but these categorizations or physicochemical distances may not directly reflect the differences among amino acid types that are acted upon by evolution. Cognizant of the difficulty of creating a realistic model for protein evolution based solely on physicochemical principles, Dayhoff and coworkers (Dayhoff, Eck, and Eck 1972; Dayhoff, Schwartz, and Orcutt 1978) developed an empirical approach. To construct
hidden ular evolution; phylogeny
-
their empirical amino acid transition matrix, sets of
easily aligned (i.e., closely related) sequences were col- D
lected. Only closely related sequences were considered ow
because, when evolutionary distance is sufficiently lno
small, the possibility of multiple replacements can be aed
itgonocroends.trucTthe aobpsreorbvaebdilistriecplacerempelanctemepnattternmsodwele.re Wuhseend rfodm
the Dayhoff model was originally proposed, relatively tth
few protein sequences were known. More recent studies ://p
(Gonnet, Cohen, and Benner 1992; Jones, Taylor, and bm
Thornton 1992) followed the spirit of the Dayhoff ap- .eox
proach but were able to tabulate a larger number of ob- fro
served replacements. Because the Dayhoff model is em- joud
pirical, it reflects the fact that different amino acid types ran
are replaced at different rates and the fact that amino .lso
acids are usually replaced by chemically similar amino /rg
acids. by
A problem with the Dayhoff approach is that itegu
effectively models the replacement process at the av- tso
erage site in the average protein. There may be no nF
such thing as an average site in an average protein. reb
The physical environment of a protein site may greatly rayu
influence the replacement process at the site. Therefore, ,6
there may be variation among sites in the replacement 021
process. Important features of the physical environment 5
might include the secondary structure and whether the
site is on the surface or in the interior of a protein. There
have been tabulations of the observed number of amino
acid replacements for each of several different
categories of physical environment (e.g., Overington et al.
1990; Liithy, McLachlan, and Eisenberg 1991; Topham
et al. 1993; Wako and Blundell 1994), but the potential
of these tabulations to disentangle phylogenetic
correlations and constraints due to physical environment has
not been exploited previously.
In this study, we introduce a probabilistic model
that relates protein secondary structure to protein
evolution. Our empirical approach is similar to the Dayhoff
procedure but we model the replacement process for
each of several categories of physical environment. We
classify the physical environment of a site according to
the secondary structure at the site. Three categories of
secondary structure (a-helix, B-sheet, and loop) are
being employed. The term loop is being used loosely
here to indicate that a site is neither in an o-helix nor
in a P-sheet. In future studies, we plan to consider more
detailed classification schemes.
The Model
The Known
Data Set
We utilize a data set maintained by one of us
(D.T.J.) that contains representative sequences of 207
protein families. Protein families are only included in
the data set if the tertiary structure of at least one
member of the family has been experimentally determined.
We will refer to this data set as the known structure
data set. Most of the protein families in the known
structure data set are represented by sequences in
addition to the one with known tertiary structure. The
members of each protein family are aligned. Fortunately,
alignment is not difficult for these sequences because
the sequences included in the data set are relatively
similar to one another.
To convert information in the known structure
data set from three-dimensional structure to secondary
structure categories, we rely on the DSSP computer
program (Kabsch and Sander 1983). This program can
determine the secondary structure of a protein from the
atomic coordinates that specify its tertiary structure. We
proceed with the assumption that the category of
underlying secondary structure is identical for residues of
different sequences that are in the same column of the
alignment. This assumption is reasonable because, as
noted earlier, structural (...truncated)