Reduction of protein sequence complexity by residue grouping
Tanping Li
0
1
Ke Fan
0
1
Jun Wang
0
1
Wei Wang
0
1
0
Protein Engineering 16(5), Oxford University Press
; all rights reserved
1
National Laboratory of Solid State Microstructure, Institute of Biophysics and Department of Physics, Nanjing University
, Nanjing 210093,
China
1To whom correspondence should be addressed. E-mail: It is well known that there are some similarities among various naturally occurring amino acids. Thus, the complexity in protein systems could be reduced by sorting these amino acids with similarities into groups and then protein sequences can be simplified by reduced alphabets. This paper discusses how to group similar amino acids and whether there is a minimal amino acid alphabet by which proteins can be folded. Various reduced alphabets are obtained by reserving the maximal information for the simplified protein sequence compared with the parent sequence using global sequence alignment. With these reduced alphabets and simplified similarity matrices, we achieve recognition of the protein fold based on the similarity score of the sequence alignment. The coverage in dataset SCOP40 for various levels of reduction on the amino acid types is obtained, which is the number of homologous pairs detected by program BLAST to the number marked by SCOP40. For the reduced alphabets containing 10 types of amino acids, the ability to detect distantly related folds remains almost at the same level as that by the alphabet of 20 types of amino acids, which implies that 10 types of amino acids may be the degree of freedom for characterizing the complexity in proteins.
Introduction
Proteins are the elementary blocks which execute biological
functions in living organisms. There are many types of proteins
in nature that carry out various complicated activities. Proteins
are composed of 20 types of naturally occurring amino acids,
and the majority of proteins are encoded by complex patterns
of these 20 types of amino acids. That is, 20 types of amino
acids introduce not only diversity and complexity into proteins,
but also some specific propensities. For example, some amino
acids are similar in physicochemical properties (Mathews and
Van Holde, 1995) and mutations of amino acids can be
tolerated in many regions of a sequence (Sinha and Nussinov,
2001). It has been discovered experimentally that some
designed proteins with fewer than 20 types of residues can
have stable native structures and contain nearly as much
information as natural proteins (Regan and Degrado, 1988;
Kamtekear et al., 1993; Davidson et al., 1995; Riddle et al.,
1997).
Recently, a 57 residue Src SH3 domain with a b-barrel-like
structure was studied (Riddle et al., 1997), and 38 out of 40
targeted residues in the domain could be replaced with five
types of residues (Ile, Ala, Glu, Lys, Gly). From a physics
viewpoint, this may imply that a 20 letter alphabet can be
reduced into an N letter alphabet by clustering the similar
amino acids into N groups, and then N letters can be chosen as
the representative residues of these N groups (Chan, 1999;
Wang and Wang, 1999). Obviously, the simplest reduction is
the so-called HP model (Chan and Dill, 1989; Lau and Dill,
1989), where 20 types of amino acids are divided into two
groups: H group and P group (H, hydrophobic residues; P, polar
residues). Interestingly, such a type of simple two-letter HP
model or the HP-like patterns could reproduce, to some extent,
the kinetics and thermodynamics of protein folding and could
be used to study the mechanism of folding (Regan and Degrado,
1988; Kamtekear et al., 1993; Davidson et al., 1995).
Previously, a five-letter alphabet based on the statistical
potential matrix by Miyazawa and Jernigan (MJ) [a pairwise
interaction potential between amino acids (Miyazawa and
Jernigan, 1996)] was studied (Chan, 1999; Wang and Wang,
1999). In that reduction, five representative residues were given
as (Ile, Ala, Glu, Lys, Gly), which coincide with the
experimental results of the 57 residue SH3 domain by Baker and
coworkers (Riddle et al., 1997). (Hereafter, the residues are
simply represented as single letters.) One of the advantages of
such a reduction is that it reduces greatly the complexity of the
protein sequences. It has been shown that sequences with these
five types of letters have good foldability and kinetic
accessibility in studies of protein-model chains (Wang and Wang,
2000). Some other simplified alphabets were also proposed
(Reidhaar-olson and Sauer, 1988; Smith and Smith, 1990;
Murphy et al., 2000; Soils and Rackovsky, 2000; Cieplak et al.,
2001). For example, an alphabet studied by Murphy et al.
(Murphy et al., 2000) was obtained from the similarity matrices
of the amino acids that characterize the correlation between the
amino acids. Cieplak et al. (Cieplak et al., 2001) simplified the
folding alphabet based on a distance of the hydrophobicity of
the natural residues defined through the MJ matrix. The
alphabet by Solis and Rackovsky (Solis and Rackovsky, 2000)
was obtained by reserving the maximal information in proteins.
In this work, the authors analyzed the relation between residues
based on their similarities that are extracted from the
interactions between the amino acids or amino acid sequence
alignment, by using various clustering schemes. The residues
were depicted as a vector in 20-dimensional space spanned with
their inter-relationship. To some degree, however, these
descriptions omit some possible correlations of the residues
within the groups. Is the consideration on the detailed
distribution or correlation of the residues in the groups helpful
for producing useful groupings related to some specific
proteins? Obviously, this is an important question for amino
Fig. 1. Sketch map for the simplification of 20 types of residues for group number N = 3. The three groups are: (F, W, Y, C, M, I, L, V), (A, G, T, S, P) and
(N, Q, D, E, H, R, K). The representative residues of three groups are set as X1, X2 and X3, respectively. Seq0 is the original protein sequence and Seqs is the
simplified one.
acid grouping studies. Its answer might promote the application
of the grouping results.
The naturally occurring frequencies of 20 types of residues
in proteins follow some type of pattern. The compositions of 20
types of amino acids in proteins may provide useful
information for the simplification of the residue alphabet. In this paper,
we integrate the information on compositions of residues into
the reduction of the residue alphabet, and cluster similar amino
acids into groups using a global alignment method. The
representative residues for each group are also obtained. Then,
the recognition tests with the reduced alphabets are discussed.
By using a simplified BLOSUM matrix based on these
schemes, we perform an all-against-all sequence alignment
and make coverage detection on the distantly related
homologous proteins throughout the database SCOP40 (Brenner
et al., 1998) for various levels of reduction. A pl (...truncated)