Mapping of protein surface cavities and prediction of enzyme class by a self-organizing neural network

Protein Engineering Design and Selection, Feb 2000

An automated computer-based method for mapping of protein surface cavities was developed and applied to a set of 176 metalloproteinases containing zinc cations in their active sites. With very few exceptions, the cavity search routine detected the active site among the five largest cavities and produced reasonable active site surfaces. Cavities were described by means of solvent-accessible surface patches. For a given protein, these patches were calculated in three steps: (i) definition of cavity atoms forming surface cavities by a grid-based technique; (ii) generation of solvent accessible surfaces; (iii) assignment of an accessibility value and a generalized atom type to each surface point. Topological correlation vectors were generated from the set of surface points forming the cavities, and projected onto the plane by a self-organizing network. The resulting map of 865 enzyme cavities displays clusters of active sites that are clearly separated from the other cavities. It is demonstrated that both fully automated recognition of active sites, and prediction of enzyme class can be performed for novel protein structures at high accuracy.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

http://peds.oxfordjournals.org/content/13/2/83.full.pdf

Mapping of protein surface cavities and prediction of enzyme class by a self-organizing neural network

Martin Stahl 0 Chiara Taroni 0 Gisbert Schneider 0 0 F.Hoffmann-La Roche Ltd, Pharmaceuticals Research , CH-4070 Basel , Switzerland 1To whom correspondence should be addressed; email: An automated computer-based method for mapping of protein surface cavities was developed and applied to a set of 176 metalloproteinases containing zinc cations in their active sites. With very few exceptions, the cavity search routine detected the active site among the five largest cavities and produced reasonable active site surfaces. Cavities were described by means of solvent-accessible surface patches. For a given protein, these patches were calculated in three steps: (i) definition of cavity atoms forming surface cavities by a grid-based technique; (ii) generation of solvent accessible surfaces; (iii) assignment of an accessibility value and a generalized atom type to each surface point. Topological correlation vectors were generated from the set of surface points forming the cavities, and projected onto the plane by a self-organizing network. The resulting map of 865 enzyme cavities displays clusters of active sites that are clearly separated from the other cavities. It is demonstrated that both fully automated recognition of active sites, and prediction of enzyme class can be performed for novel protein structures at high accuracy. - Knowledge of the three-dimensional structure of a target protein is a rich source of information for computer-aided drug design. Of special interest are the size and form of the active site, and the distribution of functional groups and lipophilic areas. As the number of solved X-ray structures of proteins is rapidly increasing, it is both possible and desirable to address questions related to coverage of the protein structure universe, conserved arrangements of functional groups or common ligand binding patterns (Alberts et al., 1998; Young et al., 1999). However, such an analysis cannot be performed by visual inspection of structural models only. It is evident that an automatic procedure for automated analysis, prediction and comparison of potential binding sites in proteins could be a very helpful tool (Bohm, 1998). Here we describe the implementation of a computational method for (i) automated detection of protein surface pockets, (ii) generation of a property-encoded solvent accessible surface (SAS) for each pocket, (iii) generation of topological correlation vectors of the SAS and (iv) projection (visualization) of these vectors onto a planar display by means of self-organizing maps (SOM). As a result, a two-dimensional map was obtained which displays the distribution of surface cavities in a chemical property space. This method was applied to a set of 176 proteins from the Protein Data Base (PDB) containing a catalytically active zinc ion in the active site (Bernstein et al., 1977). On the resulting SOM, active site pockets are clearly separated from other surface depressions for the majority of proteins. A more detailed analysis showed that the automated mapping of the active sites accurately reflects established enzyme classification. This can give new insight into local structural similarities between enzymes revealing completely different folds and functions. Furthermore, the mapping technique allowed for the correct classification of 90 surface pockets derived from 18 additional zinc containing proteins that were not contained in the training set. Protein data collection A training set of 175 protein structures was selected from the PDB. It contained all proteins accessible on July 17, 1998, carrying a catalytically active zinc cation in the active site with at least two nitrogen atoms in the zinc coordination sphere. It was found that the raw collection was biased towards structures of carbonic anhydrases I and II. Therefore, all structures of mutants of these enzymes were removed. The structures of three procarboxypeptidases remained in the set (1pyt, 1pca, 1nsa), although these represent inactive enzyme forms. In addition, a test set consisting of 18 proteins was compiled to estimate the accuracy of our prediction system: 1bc2, 1bn1, 1bn3, 1bn4, 1bnn, 1bnq, 1bnt, 1bnu, 1bnw, 1bv3, 1bvt, 1cpx, 1kop, 1koq, 1sxs, 2anh, 2bmi and 4aig. These structures were made accessible in the PDB between July and December 1998. Detection of surface cavities A rectangular, Cartesian grid with 1 spacing (a b c grid points) was generated around the protein (Figure 1a). Grid points within 0.8 from the van-der-Waals surface of a protein atom were marked as protein. Remaining points were marked as solvent. To define a grid-based surface, solvent points were selected that were spaced less than 2 apart from a protein point. For these defined surface points, a crude accessibility measure was calculated: starting from a given grid point, the program scanned along the positive and negative x, y and z axes, and the four cubic diagonals on the grid, yielding a total of 14 scan directions. A maximum of 10 steps on the grid was considered along each direction. When a protein grid point was encountered during a scan, a counter variable with an initial maximal value of 14 was decremented. This results in large accessibility values, x, for surface grid points close to convex parts of the protein surface, and yields low values for points within clefts or surface depressions (Figure 1b). As a next step, all surface grid points with x 4 were reset to solvent. A surface grid point was also reset if less than 10 of the surrounding 26 grid points were marked as surface. As a result, the remaining surface grid points Fig. 1. Schematic description of the cavity detection process. (a) The protein is embedded in a rectangular grid, grid points are marked as protein (squares with bold edges) or solvent (gray-shaded squares); (b) accessibility values are calculated for solvent points, a threshold criterion is applied to define cavities; (c) contiguous clusters of cavity points are detected and excised (for details see text); (d) Connolly surfaces of cavity-forming protein atoms are calculated. formed contiguous clusters defining protein surface pockets (Figure 1c). The pockets were sorted according to the number of grid points involved. Finally, pocket atoms were defined as being the protein grid points closest to any surface point. Variants of this algorithm have been applied by us in a different context (Stahl and Bo hm, 1998), and are part of the LIGSITE program (Hendlich et al., 1997). Calculation of cavity surfaces Solvent accessible surfaces (SAS) were calculated by the Connolly algorithm (Figure 1d) (Connolly, 1983). For all cavity surface points, an accurate accessibility value, x, was calculated that employs 45 instead of 14 scan directions (cf. previous paragraph). This algorithm has been described elsewhere (Stahl et al., 1999). In the present work, scans were performed up to a distance of 9 from the surface points. We found empirically that the majority of the surface points with an accessibility value above 25 were situated outside the binding pocket. Therefore, these points were removed. Finally, small disconnected surface patches were automatically removed, including surface points with a distance 4 to the next grid point of the pocket. Assignment of interaction type One out of five possible interaction types (aliphatic, hydrogen-bond donor, hydrogen-bond acceptor, aromatic-face and aromatic-edge) was assigned to each of the surface points. A point was marked as aliphatic if the closest atom center contained an sp3-hybridized CH-, CH2- or CH3-group, a sulfur atom engaged in a disulfide bridge or a carboxylate carbon atom. Surface points of thiol and hydroxy groups, primary amines and metal cations were classified as donor points, surface points of carboxylate oxygen atoms were classified as acceptors. The assignment of interaction types to points forming the surface of other functional groups depended on their relative positions. Atoms that were part of amides, guanidinium groups and aromatic rings were assigned two vectors: a unit vector perpendicular to the plane of the corresponding functional group (v1), and a unit vector pointing towards the center of the functional group (v2). This center was defined to be the central carbon atom of an amide or guanidinium moiety, or the geometric center of an aromatic ring respectively. The surface normal vector s of a given surface point P was calculated by the Connolly algorithm (Connolly, 1983). If the angle between s and v1 was larger than 50 or the angle between s and v2 was smaller than 80, P was marked as aromatic-face. If these conditions did not apply, the interaction type of P were donor, acceptor or aromatic-edge, depending on the atom type of the closest atom. The surface description resulting from this algorithm has two advantages over a simple atom type code: (i) it includes orientation-dependent features of functional groups on the protein surface, and (ii) it is complementary to the surface properties of potential ligands binding to protein surface pockets. Self-organizing maps (vide infra) generated from the interaction atom type descriptors proved to be superior to those generated with orientation-independent atom types (results not shown). Generation of topological correlation vectors A set of SAS points defining a protein pocket plus their associated accessibility values, x, and their interaction atom types T served as a starting point for the generation of topological cross-correlation vectors CV. All pairs of SAS points with a distance 0 d 15 were considered. This range was divided into 10 equal distance bins CVd (width 1.5 ). Each distance bin was further subdivided into 15 bins for each pair of interaction atom types, resulting in 150 vector elements CV Td for CV. Each vector element is a sum xAxB, where (A, B) are pairs of surface points falling into the distance bin d and having the interaction atom types specified by T. Self-organizing maps (SOM) Kohonens self-organizing neural network provides a method for topology-preserving nonlinear projection of a high-dimensional space onto a low-dimensional display (Kohonen, 1982). A thorough description of the algorithm can be found elsewhere (Schneider and Wrede, 1998; Kohonen, 1989). The idea is to pave the high-dimensional data space (here spanned by the correlation vectors derived from protein cavities) similar to Voronoi or Dirichlet tesselation to obtain receptive fields of artificial neurons. As a result of this self-organization process, the receptive fields represent data clusters, and their relative arrangement in high-dimensional space can be visualized on a low-dimensional display. Here, a two-dimensional map of 20 20 neurons (clusters) with toroidal topology was used (Kohonen, 1989). Distances between cavity correlation vectors were calculated by the standard euclidian distance measure. The SOM was optimized by the conventional Kohonen algorithm, using a Gaussian neighborhood function and an initial update radius of 13 neurons (Roche in-house software, NEUROMAP) (Schneider and Wrede, 1998). Results and discussion The aims of this work were to assess the usefulness of SOMs and our protein cavity descriptors for classification and prediction of active sites. A set of 175 protein structures served as training data, containing enzymes of closely as well as only distantly related families, enzymes of completely different function and multiple X-ray structures of a number of enzymes. It therefore allowed for an analysis of clustering behavior at various levels of sequence and structure similarity. For each protein, the largest surface depressions were determined by the cavity search routine. In 64% of the training set, the active site was identical to the largest surface pocket; all other active sites were among the five largest pockets (rank 2, 23%; rank 3, 6%; rank 4, 5%; rank 5, 2%). Thus, property-encoded surfaces were calculated for the five largest surface pockets only. This resulted in 865 examples, for which topological correlation vectors were generated. In the following we will use the terms active pocket and inactive pocket as short forms for active site and non-active site pockets. In Figure 2 the distribution of active and inactive pockets is shown on a two-dimensional SOM, as defined by the cavity descriptors. A separation of active (gray) and inactive pockets (white) is striking. Several groups of active pockets are surrounded by empty neurons (black) and can thereby be distinguished from the large, coherent white areas where inactive pockets are grouped together. This observation strongly supports the usefulness of our correlation vector representation of protein surface cavities and indicates that relevant biological features have been captured. Fig. 2. Nonlinear projection of the distribution of protein surface cavities in a chemical space spanned by topological correlation vectors. A toroidal 20 20 self-organizing map was used. The receptive fields of the neurons are indicated by squares, and the location of different cavity types is shown by gray-shading. Black, empty neuron; white, inactive pockets; dark gray, metalloproteinase active pockets; light gray, other Zn2 -containing active pockets; cross-hatched, multiple cavity types clustered; arrow, row of cavity structures shown in Figure 3. Metalloproteinase active pockets (dark gray) form three separate groups, whereas the majority of the other active pockets (light gray) cluster in the center of the map (Figure 2). Visual inspection of the pocket surfaces suggests four large groups which canwith some simplificationbe regarded as four areas on the map separated by diagonal lines. The upper left corner of the map, consisting mainly of inactive pockets, is dominated by small, shallow surface depressions. The large, diagonal band of active pockets contains medium-sized, valleyshaped surfaces. The size and complexity of these cavities increases towards the lower right corner of the map. The two groups of metalloproteinase active pockets in the lower right quadrant of the map are small, representing relatively deep and narrow active sites. These observations are partly illustrated in Figure 3. It shows simplified lateral views on the pockets represented by row number 5 on the map shown in Figure 2 (arrow in Figure 2). One should keep in mind that the map actually forms a torus, i.e. neuron [20,5] is directly adjacent to neuron [1,5]. As a result of topology-preservation, adjacent neurons contain similar protein cavity structures. Neurons [6,5] and [14,5] are empty, reflecting comparably large structural differences between the neurons separated by these voids. The distribution of cavity shapes along the neurons shown in Figure 3 clearly shows that the SOM was able to perform a reasonable mapping of cavity space defined by our correlation vector representation. In Figure 4a, the distribution of the different types of active pockets is displayed, and the corresponding enzyme classes are marked by numbers (cf. legend of Figure 4). It is immediately obvious that most classes of enzymes form individual clusters. There are two notable exceptions to this observation. Superoxide dismutases have extremely shallow active sites, which leads to small active site surface patches that cannot be well distinguished from the bulk of inactive pockets by means of the correlation vectors. Therefore, several members of Fig. 3. Lateral views of protein cavity structures projected onto a row of adjacent neurons of the self-organizing feature map. Neuron positions on the map are given in brackets (cf. Figures 2 or 4). Note that neuron [20,5] is directly adjacent to neuron [1,5]. Neurons [6,5] and [14,5] do not contain structures (empty clusters). this group are scattered among inactive pockets (e.g. neurons [7/15] and [8/15]). -Lactamases are the second exception. These enzymes possess a variable loop region above the active site that can adopt variable conformations (in some cases no density is observed for this part of the structure) (Philippon et al., 1998). This is reflected in greatly varying accessibility values for the active site surfaces (e.g. neurons [11/4] and [12/9]). Carbonic anhydrases form the largest of the contiguous clusters of active pockets (Figure 4a). Interestingly, this cluster is divided into two subgroups. An all-against-all sequence comparison with BLAST2 (Altschul and Gish, 1996) and Fig. 4. Distribution of metalloproteinase classes on a self-organizing map (cf. Fig. 2). (a) Training data projection; (b) Test data projection. A, adenosine deaminase; B, -lactamase; C, carbonic anhydrase; D, L-fucose-1-phosphate aldolase; E, alkaline phosphatase; F, purple acid phosphatase; G, other Zn2 -containing active pockets; H, Cu/Zn superoxide dismutase; I, astacine; L, adamalysins; M, matrixins; O, other metalloproteinases; P, procarboxypeptidase; S, serralysins; T, thermolysin and neutral protease; X, carboxypeptidases; dotted, inactive pocket; cross-hatched, multiple cavity types clustered. subsequent clustering using the JarvisPatrick algorithm (Jarvis and Patrick, 1973) revealed that the members within each subgroup possess more than 97% pair-wise sequence identity, while any pair of sequences from the two groups has less than 60% identity. The outlier in neuron [8,16], 1thj, has only 55% sequence identity to the large group and adopts a different fold than members of the two large clusters. 1thj forms a single stranded, left-handed -helix (or -solenoid). The other carbonic anhydrases, instead, have an / roll architecture. Obviously, for the class of carbonic anhydrases, differences in sequence are paralleled in differences in active site features. Alkaline phosphatases, metzincins, thermolysins and carboxypeptidases also form large groups of active pockets (Figure 4a). Clustering of enzymes of the same type is not perfect in all cases, e.g. there are two outliers in neurons [8,1] and [10,1], which should be part of the thermolysin and the alkaline phosphatase group, respectively. Visual inspection reveals that these pocket surfaces cannot be distinguished from the remaining members of the respective group; therefore, their location on the map must be attributed to errors during the maps self-organization process. It is well-known that conventional Kohonen-type SOMs tend to certain topology distortions inherent to the training algorithm (Kohonen, 1982; Graepel and Obermayer, 1998). Furthermore, these observations might be due to premature convergence of the training process (Bienfait and Gasteiger, 1997a,b). Despite these disadvantages SOMs are generally considered as being well-suited for visualization of high-dimensional spaces (Kirew et al., 1998; Schneider and Wrede, 1998; Schneider et al., 1998). Recently, some modifications of the training algorithm and additional methods have been suggested that might help to overcome the problems mentioned (Graepel and Obermayer, 1998; Wang et al., 1998). The arrangement of metalloproteinases on the map warrants a more detailed analysis. Endo- and exopeptidases are well separated (Figure 4a). The group of carboxypeptidase A active pockets is located at a large distance from the other metalloproteinase pockets, and is also set apart from a distant relative, the muramoyl pentapeptide carboxypeptidase (1lbu, neuron [15,11]), as well as from their inactive pro-enzymes (denoted by P in Figure 4a). The large family of metzincin enzymes is divided into three groups (Sto cker et al., 1995; Borkakoti, 1998). This seems to be a consequence of their active site shape: the matrix metalloproteinase cluster in the lower left corner of the map is characterized by active pockets containing a stretched-out and rather shallow cavity, and an S1-pocket of moderate size (MMP-1, MMP-9). In contrast, the second cluster located in the lower right corner of the map contains pockets dominated by deep S1 subsites (MMP-8, adamalysins). Some of the pocket surfaces of both groups are depicted in Figure 3 (neurons [17/5], [18/5], [19/5]). The separate neuron [14,4] in Figure 4a contains pockets with tunnel-shaped S1-pockets (MMP-3 structures and one neutrophile collagenase, 1mnc, with Arg222 rotated away from the bottom of the S1 pocket). The small group of serralysines is slightly set apart from the other metzincin family proteins due to the fact that their active pockets possess more complex, distorted surfaces (Bode et al., 1996). Up to this point, the analysis of the SOMs has shown that the automatic projection of surface-derived correlation vectors leads to an intuitively reasonable arrangement of protein pockets. It is apparent, however, that the specific position of a pocket on the map depends on the various empirical cut-off values used in our cavity analysis. We were therefore interested to see whether the trained SOM (Figures 2 and 4a) permits correct predictions for enzymes not contained in the training set. Surface pockets and the corresponding correlation vectors were calculated for a set of 18 zinc enzyme structures from the PDB not contained in the training set. Their projection on the trained SOM is shown in Figure 4b. With one exception, carbonic anhydrase (PDB-codes 1kop, 1koq) located in neuron [11,14], all active pockets and all inactive pockets were correctly classified. Furthermore, the active pockets were placed within clusters of the correct type of enzyme. This means that our method is sufficiently robust for accurately predicting the enzyme type, provided that members of the enzyme class were covered by the training set. In the present work we have successfully applied a novel method for automatic recognition of cavities on the surface of protein structures. The applicability of nonlinear projection by conventional Kohonen-type neural networks for data visualization was substantiated. This method complements other automated procedures for locating binding pockets based on triangulation techniques (Liang et al., 1998). To further improve the accuracy of the SOM projections, modified or other nonlinear mapping algorithms might be useful (Bienfait and Gasteiger, 1997a,b; Schneider and Wrede, 1998; Schneider et al., 1998). Furthermore, we were able to demonstrate that correlation vectors encoding the distribution of generalized atom types and the shape of surface pockets are suited for classification of (i) active and inactive sites, and (ii) accurate prediction of the enzymatic class of test set proteins. This was verified taking the surface cavities of a set of zinc-containing metalloproteinases as an example. We are convinced that this and similar techniques bear a significant potential for automated protein structure analysis and drug design (Verdonk et al., 1999). Hans-Joachim Bohm, Daniel Bur and Petra Schneider are thanked for helpful discussions and comments on the manuscript.


This is a preview of a remote PDF: http://peds.oxfordjournals.org/content/13/2/83.full.pdf

Martin Stahl, Chiara Taroni, Gisbert Schneider. Mapping of protein surface cavities and prediction of enzyme class by a self-organizing neural network, Protein Engineering Design and Selection, 2000, 83-88, DOI: 10.1093/protein/13.2.83