Reduction of protein sequence complexity by residue grouping

Protein Engineering Design and Selection, May 2003

It is well known that there are some similarities among various naturally occurring amino acids. Thus, the complexity in protein systems could be reduced by sorting these amino acids with similarities into groups and then protein sequences can be simplified by reduced alphabets. This paper discusses how to group similar amino acids and whether there is a minimal amino acid alphabet by which proteins can be folded. Various reduced alphabets are obtained by reserving the maximal information for the simplified protein sequence compared with the parent sequence using global sequence alignment. With these reduced alphabets and simplified similarity matrices, we achieve recognition of the protein fold based on the similarity score of the sequence alignment. The coverage in dataset SCOP40 for various levels of reduction on the amino acid types is obtained, which is the number of homologous pairs detected by program BLAST to the number marked by SCOP40. For the reduced alphabets containing 10 types of amino acids, the ability to detect distantly related folds remains almost at the same level as that by the alphabet of 20 types of amino acids, which implies that 10 types of amino acids may be the degree of freedom for characterizing the complexity in proteins.

Article PDF cannot be displayed. You can download it here:

https://peds.oxfordjournals.org/content/16/5/323.full.pdf

Reduction of protein sequence complexity by residue grouping

Tanping Li 0 1 Ke Fan 0 1 Jun Wang 0 1 Wei Wang 0 1 0 Protein Engineering 16(5), Oxford University Press ; all rights reserved 1 National Laboratory of Solid State Microstructure, Institute of Biophysics and Department of Physics, Nanjing University , Nanjing 210093, China 1To whom correspondence should be addressed. E-mail: It is well known that there are some similarities among various naturally occurring amino acids. Thus, the complexity in protein systems could be reduced by sorting these amino acids with similarities into groups and then protein sequences can be simplified by reduced alphabets. This paper discusses how to group similar amino acids and whether there is a minimal amino acid alphabet by which proteins can be folded. Various reduced alphabets are obtained by reserving the maximal information for the simplified protein sequence compared with the parent sequence using global sequence alignment. With these reduced alphabets and simplified similarity matrices, we achieve recognition of the protein fold based on the similarity score of the sequence alignment. The coverage in dataset SCOP40 for various levels of reduction on the amino acid types is obtained, which is the number of homologous pairs detected by program BLAST to the number marked by SCOP40. For the reduced alphabets containing 10 types of amino acids, the ability to detect distantly related folds remains almost at the same level as that by the alphabet of 20 types of amino acids, which implies that 10 types of amino acids may be the degree of freedom for characterizing the complexity in proteins. Introduction Proteins are the elementary blocks which execute biological functions in living organisms. There are many types of proteins in nature that carry out various complicated activities. Proteins are composed of 20 types of naturally occurring amino acids, and the majority of proteins are encoded by complex patterns of these 20 types of amino acids. That is, 20 types of amino acids introduce not only diversity and complexity into proteins, but also some specific propensities. For example, some amino acids are similar in physicochemical properties (Mathews and Van Holde, 1995) and mutations of amino acids can be tolerated in many regions of a sequence (Sinha and Nussinov, 2001). It has been discovered experimentally that some designed proteins with fewer than 20 types of residues can have stable native structures and contain nearly as much information as natural proteins (Regan and Degrado, 1988; Kamtekear et al., 1993; Davidson et al., 1995; Riddle et al., 1997). Recently, a 57 residue Src SH3 domain with a b-barrel-like structure was studied (Riddle et al., 1997), and 38 out of 40 targeted residues in the domain could be replaced with five types of residues (Ile, Ala, Glu, Lys, Gly). From a physics viewpoint, this may imply that a 20 letter alphabet can be reduced into an N letter alphabet by clustering the similar amino acids into N groups, and then N letters can be chosen as the representative residues of these N groups (Chan, 1999; Wang and Wang, 1999). Obviously, the simplest reduction is the so-called HP model (Chan and Dill, 1989; Lau and Dill, 1989), where 20 types of amino acids are divided into two groups: H group and P group (H, hydrophobic residues; P, polar residues). Interestingly, such a type of simple two-letter HP model or the HP-like patterns could reproduce, to some extent, the kinetics and thermodynamics of protein folding and could be used to study the mechanism of folding (Regan and Degrado, 1988; Kamtekear et al., 1993; Davidson et al., 1995). Previously, a five-letter alphabet based on the statistical potential matrix by Miyazawa and Jernigan (MJ) [a pairwise interaction potential between amino acids (Miyazawa and Jernigan, 1996)] was studied (Chan, 1999; Wang and Wang, 1999). In that reduction, five representative residues were given as (Ile, Ala, Glu, Lys, Gly), which coincide with the experimental results of the 57 residue SH3 domain by Baker and coworkers (Riddle et al., 1997). (Hereafter, the residues are simply represented as single letters.) One of the advantages of such a reduction is that it reduces greatly the complexity of the protein sequences. It has been shown that sequences with these five types of letters have good foldability and kinetic accessibility in studies of protein-model chains (Wang and Wang, 2000). Some other simplified alphabets were also proposed (Reidhaar-olson and Sauer, 1988; Smith and Smith, 1990; Murphy et al., 2000; Soils and Rackovsky, 2000; Cieplak et al., 2001). For example, an alphabet studied by Murphy et al. (Murphy et al., 2000) was obtained from the similarity matrices of the amino acids that characterize the correlation between the amino acids. Cieplak et al. (Cieplak et al., 2001) simplified the folding alphabet based on a distance of the hydrophobicity of the natural residues defined through the MJ matrix. The alphabet by Solis and Rackovsky (Solis and Rackovsky, 2000) was obtained by reserving the maximal information in proteins. In this work, the authors analyzed the relation between residues based on their similarities that are extracted from the interactions between the amino acids or amino acid sequence alignment, by using various clustering schemes. The residues were depicted as a vector in 20-dimensional space spanned with their inter-relationship. To some degree, however, these descriptions omit some possible correlations of the residues within the groups. Is the consideration on the detailed distribution or correlation of the residues in the groups helpful for producing useful groupings related to some specific proteins? Obviously, this is an important question for amino Fig. 1. Sketch map for the simplification of 20 types of residues for group number N = 3. The three groups are: (F, W, Y, C, M, I, L, V), (A, G, T, S, P) and (N, Q, D, E, H, R, K). The representative residues of three groups are set as X1, X2 and X3, respectively. Seq0 is the original protein sequence and Seqs is the simplified one. acid grouping studies. Its answer might promote the application of the grouping results. The naturally occurring frequencies of 20 types of residues in proteins follow some type of pattern. The compositions of 20 types of amino acids in proteins may provide useful information for the simplification of the residue alphabet. In this paper, we integrate the information on compositions of residues into the reduction of the residue alphabet, and cluster similar amino acids into groups using a global alignment method. The representative residues for each group are also obtained. Then, the recognition tests with the reduced alphabets are discussed. By using a simplified BLOSUM matrix based on these schemes, we perform an all-against-all sequence alignment and make coverage detection on the distantly related homologous proteins throughout the database SCOP40 (Brenner et al., 1998) for various levels of reduction. A pl (...truncated)


This is a preview of a remote PDF: https://peds.oxfordjournals.org/content/16/5/323.full.pdf
Article home page: http://peds.oxfordjournals.org/content/16/5/323.abstract

Tanping Li, Ke Fan, Jun Wang, Wei Wang. Reduction of protein sequence complexity by residue grouping, Protein Engineering Design and Selection, 2003, pp. 323-330, 16/5, DOI: 10.1093/protein/gzg044