Reduction of protein sequence complexity by residue grouping

Protein Engineering Design and Selection, May 2003

It is well known that there are some similarities among various naturally occurring amino acids. Thus, the complexity in protein systems could be reduced by sorting these amino acids with similarities into groups and then protein sequences can be simplified by reduced alphabets. This paper discusses how to group similar amino acids and whether there is a minimal amino acid alphabet by which proteins can be folded. Various reduced alphabets are obtained by reserving the maximal information for the simplified protein sequence compared with the parent sequence using global sequence alignment. With these reduced alphabets and simplified similarity matrices, we achieve recognition of the protein fold based on the similarity score of the sequence alignment. The coverage in dataset SCOP40 for various levels of reduction on the amino acid types is obtained, which is the number of homologous pairs detected by program BLAST to the number marked by SCOP40. For the reduced alphabets containing 10 types of amino acids, the ability to detect distantly related folds remains almost at the same level as that by the alphabet of 20 types of amino acids, which implies that 10 types of amino acids may be the degree of freedom for characterizing the complexity in proteins.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://peds.oxfordjournals.org/content/16/5/323.full.pdf

Reduction of protein sequence complexity by residue grouping

Tanping Li 0 1 Ke Fan 0 1 Jun Wang 0 1 Wei Wang 0 1 0 Protein Engineering 16(5), Oxford University Press ; all rights reserved 1 National Laboratory of Solid State Microstructure, Institute of Biophysics and Department of Physics, Nanjing University , Nanjing 210093, China 1To whom correspondence should be addressed. E-mail: It is well known that there are some similarities among various naturally occurring amino acids. Thus, the complexity in protein systems could be reduced by sorting these amino acids with similarities into groups and then protein sequences can be simplified by reduced alphabets. This paper discusses how to group similar amino acids and whether there is a minimal amino acid alphabet by which proteins can be folded. Various reduced alphabets are obtained by reserving the maximal information for the simplified protein sequence compared with the parent sequence using global sequence alignment. With these reduced alphabets and simplified similarity matrices, we achieve recognition of the protein fold based on the similarity score of the sequence alignment. The coverage in dataset SCOP40 for various levels of reduction on the amino acid types is obtained, which is the number of homologous pairs detected by program BLAST to the number marked by SCOP40. For the reduced alphabets containing 10 types of amino acids, the ability to detect distantly related folds remains almost at the same level as that by the alphabet of 20 types of amino acids, which implies that 10 types of amino acids may be the degree of freedom for characterizing the complexity in proteins. Introduction Proteins are the elementary blocks which execute biological functions in living organisms. There are many types of proteins in nature that carry out various complicated activities. Proteins are composed of 20 types of naturally occurring amino acids, and the majority of proteins are encoded by complex patterns of these 20 types of amino acids. That is, 20 types of amino acids introduce not only diversity and complexity into proteins, but also some specific propensities. For example, some amino acids are similar in physicochemical properties (Mathews and Van Holde, 1995) and mutations of amino acids can be tolerated in many regions of a sequence (Sinha and Nussinov, 2001). It has been discovered experimentally that some designed proteins with fewer than 20 types of residues can have stable native structures and contain nearly as much information as natural proteins (Regan and Degrado, 1988; Kamtekear et al., 1993; Davidson et al., 1995; Riddle et al., 1997). Recently, a 57 residue Src SH3 domain with a b-barrel-like structure was studied (Riddle et al., 1997), and 38 out of 40 targeted residues in the domain could be replaced with five types of residues (Ile, Ala, Glu, Lys, Gly). From a physics viewpoint, this may imply that a 20 letter alphabet can be reduced into an N letter alphabet by clustering the similar amino acids into N groups, and then N letters can be chosen as the representative residues of these N groups (Chan, 1999; Wang and Wang, 1999). Obviously, the simplest reduction is the so-called HP model (Chan and Dill, 1989; Lau and Dill, 1989), where 20 types of amino acids are divided into two groups: H group and P group (H, hydrophobic residues; P, polar residues). Interestingly, such a type of simple two-letter HP model or the HP-like patterns could reproduce, to some extent, the kinetics and thermodynamics of protein folding and could be used to study the mechanism of folding (Regan and Degrado, 1988; Kamtekear et al., 1993; Davidson et al., 1995). Previously, a five-letter alphabet based on the statistical potential matrix by Miyazawa and Jernigan (MJ) [a pairwise interaction potential between amino acids (Miyazawa and Jernigan, 1996)] was studied (Chan, 1999; Wang and Wang, 1999). In that reduction, five representative residues were given as (Ile, Ala, Glu, Lys, Gly), which coincide with the experimental results of the 57 residue SH3 domain by Baker and coworkers (Riddle et al., 1997). (Hereafter, the residues are simply represented as single letters.) One of the advantages of such a reduction is that it reduces greatly the complexity of the protein sequences. It has been shown that sequences with these five types of letters have good foldability and kinetic accessibility in studies of protein-model chains (Wang and Wang, 2000). Some other simplified alphabets were also proposed (Reidhaar-olson and Sauer, 1988; Smith and Smith, 1990; Murphy et al., 2000; Soils and Rackovsky, 2000; Cieplak et al., 2001). For example, an alphabet studied by Murphy et al. (Murphy et al., 2000) was obtained from the similarity matrices of the amino acids that characterize the correlation between the amino acids. Cieplak et al. (Cieplak et al., 2001) simplified the folding alphabet based on a distance of the hydrophobicity of the natural residues defined through the MJ matrix. The alphabet by Solis and Rackovsky (Solis and Rack (...truncated)


This is a preview of a remote PDF: https://peds.oxfordjournals.org/content/16/5/323.full.pdf

Tanping Li, Ke Fan, Jun Wang, Wei Wang. Reduction of protein sequence complexity by residue grouping, Protein Engineering Design and Selection, 2003, pp. 323-330, 16/5, DOI: 10.1093/protein/gzg044