Data mining techniques to study the disulfide-bonding state in proteins: signal peptide is a strong descriptor (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/20/16/2509.full.pdf

Data mining techniques to study the disulfide-bonding state in proteins: signal peptide is a strong descriptor

Dominique Tessier 0 Benjamin Bardiaux 0 Colette Larr 0 Yves Popineau 0 0 Unit de Recherche sur les Protines Vgtales et leurs Interactions , INRA Rue de la Graudire, BP 71627, 44316 Nantes Cedex 3, France In the eucaryotic cell, the formation of disulfide bonds takes place in general inside the endoplasmic reticulum which provides a unique folding environment. The DisulfideDB database gathers information about this biological process with structural, evolutionary and neighborhood information on cysteines in proteins. Mining this information with an association rule discovery program permits to extract some strong rules for the prediction of the disulfide-bonding state of cysteines. Contact: Supplementary information: The web supplement to this paper, including the UML diagram of the database and some procedures used with the association rule discovery tool, may be found at http://www.nantes.inra.fr/centre/unitesrecherche/urpvi/bioinformatique/publi.html. - INTRODUCTION Disulfide bridges, which are the most frequent covalent crosslinks found in proteins, play an important role in protein structure, and several methods for prediction of the disulfidebonding state of cysteines in proteins have been developed since the early 1990s. The first generation of methods was based on the assumption that the local sequence alone determines the disulfide-bonding state of a cysteine. These methods used neural networks to recognize this bonding state (Muskal et al., 1990) or a statistical analysis of the amino acid frequencies in the sequence environment of the cysteine (Fiser et al., 1992). Two other methods later contributed evolutionary information to the local environment using neural network implementation (Fariselli et al., 1999) or statistical analysis (Fiser and Simon, 2000). More recently, new methods integrated the fact that all the cysteines in a protein are generally in the same state. One of these methods is based on a combination of a neural network and a hidden Markov model (Martelli et al., 2002), another on a SVM-based predictor (Frasconi et al., 2002), and a third one is based on a combination of logistic functions learned with subsets of proteins considered to be homogeneous in terms of their overall amino acid content (Mucchielli-Giorgi et al., 2002). The best predictions of the disulfide-bonding state of cysteines have success rates close to 88%. In this study, we add new descriptors related to the biological process of protein formation to those studied previously. In fact, the formation of disulfide bonds in proteins requires a sufficiently oxidizing environment and disulfide bonds generally fail to form in the cell cytosol, where a high concentration of reducing agents converts SS bonds back to cysteineSH groups. Thus, in the living eucaryotic cell, the formation of disulfide bonds takes place inside the protected cellular environment of the endoplasmic reticulum (ER), which provides all the membranes and protein components of the Golgi, the ER itself, lysosome, endosome, plant vacuoles, secretory vesicles and the plasma membrane of the cell. The transport of proteins across the ER membrane relies on the presence of a signal peptide. In water-soluble proteins, this signal peptide is located at the N-terminal part of the sequence and is cleaved off. The translocation process for proteins meant to remain in the membrane is more complex since the signal peptide can be internal to the sequence. Overall, some proteins are imported into the ER after their synthesis has been completed (Alberts et al., 2002). This information, based on the biological process, prompted us to introduce new descriptors: the indication of the presence of a signal peptide and the subcellular location of each protein when it is known. All the descriptors we used are stored in a freely available PostgreSQL (http://www.postgresql.org) database that permits fast, secure and flexible access to data. In order to use a general, systematic method to analyze correlation between this heterogeneous data, we applied different data mining techniques and, in particular, the association rule discovery algorithm Apriori (Agrawal et al., 1993) with the WEKA package (http://www.cs.waikato.ac.nz/ml/weka/), to this constituted dataset. When discovering association rules, the aim was to exhibit relationships between data and to compute the precision of each relationship in the database. A conditional rule is of the form: If A and B then C. Usual precision measures are support and confidence. The support represents the frequency of co-occurrence of all the items appearing in the rule and the confidence represents the accuracy of the rule computed by dividing the support value by the frequency of co-occurrence of the left part of the rule. After the description of the database, we report some rules obtained on the disulfide-bonding state of cysteines, using this technique. CONSTITUTION OF THE DATABASE DisulfideDB The core of the database is based on a representative selection of the Protein Data Bank (PDB) (Berman et al., 2000) that can be accessed at the site, http://homepages.fhgiessen.de/hg12640/pdbselect, updated in December 2002, with a threshold of 25%: pdb_select_25 dataset (Hobohm and Sander, 1994). This selection does not contain homologous sequences. Since one of our objectives was to analyze relevant descriptors between free cysteines and bonded ones, we only retained chains from eucaryotic cells with at least one disulfide bond annotation in the PDB file. We then limited our study to proteins that are in a favorable environment for disulfide bond formation in organisms with an ER organelle. The number of chains and cysteines in the DisulfideDB are shown in Table 1. The table structure of the DisulfideDB is designed to represent different levels of data description. Local information around each cysteine concerns its position, its solvent accessibility extracted from DSSP (Kabsch and Sander, 1983) and its percentage of conservation inside its family extracted from HSSP (Dodge et al., 1998). The list of the sequential neighbors contained in a window of 11 amino acids centered on the cysteine, and the list of spatial neighbors included within a distance between C lower than 7 calculated from the PDB files completed the local descriptors. The table SSBOND connects two cysteines of disulfide bridges and mentions the type of the disulfide bridge, either intramolecular or intermolecular. The next level of information is the description of the protein topology with the list of its secondary structures extracted from the PDB files and the description of hydrophobic regions calculated on a window of 21 residues with the KyteDoolittle hydropathy scale (Kyte and Doolittle, 1982). Finally, information such as the presence of a signal peptide, the subcellular location, mapping between the PDB names and protein sequence names were collected. We obtained subcellular locations from the SWISS-PROT data (...truncated)