Data mining techniques to study the disulfide-bonding state in proteins: signal peptide is a strong descriptor
Dominique Tessier
0
Benjamin Bardiaux
0
Colette Larr
0
Yves Popineau
0
0
Unit de Recherche sur les Protines Vgtales et leurs Interactions
, INRA Rue de la Graudire, BP 71627, 44316 Nantes Cedex 3,
France
In the eucaryotic cell, the formation of disulfide bonds takes place in general inside the endoplasmic reticulum which provides a unique folding environment. The DisulfideDB database gathers information about this biological process with structural, evolutionary and neighborhood information on cysteines in proteins. Mining this information with an association rule discovery program permits to extract some strong rules for the prediction of the disulfide-bonding state of cysteines. Contact: Supplementary information: The web supplement to this paper, including the UML diagram of the database and some procedures used with the association rule discovery tool, may be found at http://www.nantes.inra.fr/centre/unitesrecherche/urpvi/bioinformatique/publi.html.
-
INTRODUCTION
Disulfide bridges, which are the most frequent covalent
crosslinks found in proteins, play an important role in protein
structure, and several methods for prediction of the
disulfidebonding state of cysteines in proteins have been developed
since the early 1990s.
The first generation of methods was based on the assumption
that the local sequence alone determines the disulfide-bonding
state of a cysteine. These methods used neural networks
to recognize this bonding state (Muskal et al., 1990) or
a statistical analysis of the amino acid frequencies in the
sequence environment of the cysteine (Fiser et al., 1992). Two
other methods later contributed evolutionary information to
the local environment using neural network implementation
(Fariselli et al., 1999) or statistical analysis (Fiser and Simon,
2000). More recently, new methods integrated the fact that all
the cysteines in a protein are generally in the same state. One of
these methods is based on a combination of a neural network
and a hidden Markov model (Martelli et al., 2002), another on
a SVM-based predictor (Frasconi et al., 2002), and a third one
is based on a combination of logistic functions learned with
subsets of proteins considered to be homogeneous in terms
of their overall amino acid content (Mucchielli-Giorgi et al.,
2002). The best predictions of the disulfide-bonding state of
cysteines have success rates close to 88%.
In this study, we add new descriptors related to the biological
process of protein formation to those studied previously. In
fact, the formation of disulfide bonds in proteins requires a
sufficiently oxidizing environment and disulfide bonds
generally fail to form in the cell cytosol, where a high concentration
of reducing agents converts SS bonds back to cysteineSH
groups. Thus, in the living eucaryotic cell, the formation of
disulfide bonds takes place inside the protected cellular
environment of the endoplasmic reticulum (ER), which provides all
the membranes and protein components of the Golgi, the ER
itself, lysosome, endosome, plant vacuoles, secretory vesicles
and the plasma membrane of the cell.
The transport of proteins across the ER membrane relies
on the presence of a signal peptide. In water-soluble proteins,
this signal peptide is located at the N-terminal part of the
sequence and is cleaved off. The translocation process for
proteins meant to remain in the membrane is more complex since
the signal peptide can be internal to the sequence. Overall,
some proteins are imported into the ER after their synthesis
has been completed (Alberts et al., 2002).
This information, based on the biological process, prompted
us to introduce new descriptors: the indication of the presence
of a signal peptide and the subcellular location of each protein
when it is known.
All the descriptors we used are stored in a freely
available PostgreSQL (http://www.postgresql.org) database that
permits fast, secure and flexible access to data. In order to use
a general, systematic method to analyze correlation between
this heterogeneous data, we applied different data mining
techniques and, in particular, the association rule discovery
algorithm Apriori (Agrawal et al., 1993) with the WEKA
package (http://www.cs.waikato.ac.nz/ml/weka/), to this
constituted dataset. When discovering association rules, the aim
was to exhibit relationships between data and to compute the
precision of each relationship in the database. A conditional
rule is of the form: If A and B then C. Usual precision
measures are support and confidence. The support represents the
frequency of co-occurrence of all the items appearing in the
rule and the confidence represents the accuracy of the rule
computed by dividing the support value by the frequency of
co-occurrence of the left part of the rule.
After the description of the database, we report some rules
obtained on the disulfide-bonding state of cysteines, using this
technique.
CONSTITUTION OF THE DATABASE
DisulfideDB
The core of the database is based on a representative
selection of the Protein Data Bank (PDB) (Berman et al.,
2000) that can be accessed at the site,
http://homepages.fhgiessen.de/hg12640/pdbselect, updated in December 2002,
with a threshold of 25%: pdb_select_25 dataset (Hobohm and
Sander, 1994). This selection does not contain homologous
sequences. Since one of our objectives was to analyze
relevant descriptors between free cysteines and bonded ones, we
only retained chains from eucaryotic cells with at least one
disulfide bond annotation in the PDB file. We then limited
our study to proteins that are in a favorable environment for
disulfide bond formation in organisms with an ER organelle.
The number of chains and cysteines in the DisulfideDB are
shown in Table 1.
The table structure of the DisulfideDB is designed to
represent different levels of data description.
Local information around each cysteine concerns its
position, its solvent accessibility extracted from DSSP (Kabsch
and Sander, 1983) and its percentage of conservation inside its
family extracted from HSSP (Dodge et al., 1998). The list of
the sequential neighbors contained in a window of 11 amino
acids centered on the cysteine, and the list of spatial
neighbors included within a distance between C lower than 7
calculated from the PDB files completed the local descriptors.
The table SSBOND connects two cysteines of disulfide
bridges and mentions the type of the disulfide bridge, either
intramolecular or intermolecular.
The next level of information is the description of the protein
topology with the list of its secondary structures extracted
from the PDB files and the description of hydrophobic regions
calculated on a window of 21 residues with the KyteDoolittle
hydropathy scale (Kyte and Doolittle, 1982).
Finally, information such as the presence of a signal
peptide, the subcellular location, mapping between the
PDB names and protein sequence names were collected.
We obtained subcellular locations from the SWISS-PROT
data (...truncated)