Reduction of protein sequence complexity by residue grouping
Tanping Li
0
1
Ke Fan
0
1
Jun Wang
0
1
Wei Wang
0
1
0
Protein Engineering 16(5), Oxford University Press
; all rights reserved
1
National Laboratory of Solid State Microstructure, Institute of Biophysics and Department of Physics, Nanjing University
, Nanjing 210093,
China
1To whom correspondence should be addressed. E-mail: It is well known that there are some similarities among various naturally occurring amino acids. Thus, the complexity in protein systems could be reduced by sorting these amino acids with similarities into groups and then protein sequences can be simplified by reduced alphabets. This paper discusses how to group similar amino acids and whether there is a minimal amino acid alphabet by which proteins can be folded. Various reduced alphabets are obtained by reserving the maximal information for the simplified protein sequence compared with the parent sequence using global sequence alignment. With these reduced alphabets and simplified similarity matrices, we achieve recognition of the protein fold based on the similarity score of the sequence alignment. The coverage in dataset SCOP40 for various levels of reduction on the amino acid types is obtained, which is the number of homologous pairs detected by program BLAST to the number marked by SCOP40. For the reduced alphabets containing 10 types of amino acids, the ability to detect distantly related folds remains almost at the same level as that by the alphabet of 20 types of amino acids, which implies that 10 types of amino acids may be the degree of freedom for characterizing the complexity in proteins.
Introduction
Proteins are the elementary blocks which execute biological
functions in living organisms. There are many types of proteins
in nature that carry out various complicated activities. Proteins
are composed of 20 types of naturally occurring amino acids,
and the majority of proteins are encoded by complex patterns
of these 20 types of amino acids. That is, 20 types of amino
acids introduce not only diversity and complexity into proteins,
but also some specific propensities. For example, some amino
acids are similar in physicochemical properties (Mathews and
Van Holde, 1995) and mutations of amino acids can be
tolerated in many regions of a sequence (Sinha and Nussinov,
2001). It has been discovered experimentally that some
designed proteins with fewer than 20 types of residues can
have stable native structures and contain nearly as much
information as natural proteins (Regan and Degrado, 1988;
Kamtekear et al., 1993; Davidson et al., 1995; Riddle et al.,
1997).
Recently, a 57 residue Src SH3 domain with a b-barrel-like
structure was studied (Riddle et al., 1997), and 38 out of 40
targeted residues in the domain could be replaced with five
types of residues (Ile, Ala, Glu, Lys, Gly). From a physics
viewpoint, this may imply that a 20 letter alphabet can be
reduced into an N letter alphabet by clustering the similar
amino acids into N groups, and then N letters can be chosen as
the representative residues of these N groups (Chan, 1999;
Wang and Wang, 1999). Obviously, the simplest reduction is
the so-called HP model (Chan and Dill, 1989; Lau and Dill,
1989), where 20 types of amino acids are divided into two
groups: H group and P group (H, hydrophobic residues; P, polar
residues). Interestingly, such a type of simple two-letter HP
model or the HP-like patterns could reproduce, to some extent,
the kinetics and thermodynamics of protein folding and could
be used to study the mechanism of folding (Regan and Degrado,
1988; Kamtekear et al., 1993; Davidson et al., 1995).
Previously, a five-letter alphabet based on the statistical
potential matrix by Miyazawa and Jernigan (MJ) [a pairwise
interaction potential between amino acids (Miyazawa and
Jernigan, 1996)] was studied (Chan, 1999; Wang and Wang,
1999). In that reduction, five representative residues were given
as (Ile, Ala, Glu, Lys, Gly), which coincide with the
experimental results of the 57 residue SH3 domain by Baker and
coworkers (Riddle et al., 1997). (Hereafter, the residues are
simply represented as single letters.) One of the advantages of
such a reduction is that it reduces greatly the complexity of the
protein sequences. It has been shown that sequences with these
five types of letters have good foldability and kinetic
accessibility in studies of protein-model chains (Wang and Wang,
2000). Some other simplified alphabets were also proposed
(Reidhaar-olson and Sauer, 1988; Smith and Smith, 1990;
Murphy et al., 2000; Soils and Rackovsky, 2000; Cieplak et al.,
2001). For example, an alphabet studied by Murphy et al.
(Murphy et al., 2000) was obtained from the similarity matrices
of the amino acids that characterize the correlation between the
amino acids. Cieplak et al. (Cieplak et al., 2001) simplified the
folding alphabet based on a distance of the hydrophobicity of
the natural residues defined through the MJ matrix. The
alphabet by Solis and Rackovsky (Solis and Rack (...truncated)