MolTalk – a programming library for protein structures and structure analysis
BMC Bioinformatics
Software MolTalk - a programming library for protein structures and structure analysis Alexander V Diemand*1 and Holger Scheib2
0 University of Geneva and Swiss Institute of Bioinformatics, Centre Medicale Universitaire , 1, rue Michel-Servet, 1211 Geneva 4 , Switzerland
1 University of Lausanne and Swiss Institute of Bioinformatics , 155, chemin de Boveresses, 1066 Epalinges s/Lausanne , Switzerland
Background: Two of the mostly unsolved but increasingly urgent problems for modern biologists are a) to quickly and easily analyse protein structures and b) to comprehensively mine the wealth of information, which is distributed along with the 3D co-ordinates by the Protein Data Bank (PDB). Tools which address this issue need to be highly flexible and powerful but at the same time must be freely available and easy to learn. Results: We present MolTalk, an elaborate programming language, which consists of the programming library libmoltalk implemented in Objective-C and the Smalltalk-based interpreter MolTalk. MolTalk combines the advantages of an easy to learn and programmable procedural scripting with the flexibility and power of a full programming language. An overview of currently available applications of MolTalk is given and with PDBChainSaw one such application is described in more detail. PDBChainSaw is a MolTalk-based parser and information extraction utility of PDB files. Weekly updates of the PDB are synchronised with PDBChainSaw and are available for free download from the MolTalk project page http://www.moltalk.org following the link to PDBChainSaw. For each chain in a protein structure, PDBChainSaw extracts the sequence from its coordinates and provides additional information from the PDB-file header section, such as scientific organism, compound name, and EC code. Conclusion: MolTalk provides a rich set of methods to analyse and even modify experimentally determined or modelled protein structures. These methods vary in complexity and are thus suitable for beginners and advanced programmers alike. We envision MolTalk to be most valuable in the following applications: 1) To analyse protein structures repetitively in large-scale, i.e. to benchmark protein structure prediction methods or to evaluate structural models. The quality of the resulting 3D-models can be assessed by e.g. calculating a Ramachandran-Sasisekharan plot. 2) To quickly retrieve information for (a limited number of) macro-molecular structures, i.e. H-bonds, salt bridges, contacts between amino acids and ligands or at the interface between two chains. 3) To programme more complex structural bioinformatics software and to implement demanding algorithms through its portability to Objective-C, e.g. iMolTalk. 4) To be used as a front end to databases, e.g. PDBChainSaw.
-
Background
The major demand from Life Sciences towards
bioinformatics today is to combine the often heterogeneous
information available and make it easily accessible to a broad
range of users. In the past, these efforts concentrated on
coping with the overwhelming amount of data that
entered and still enter nucleotide and protein sequence
databases [1,2]. Today, other information sources, such as
protein structures, subsequently come under the spotlight
of a broader scientific community.
In contrast to the sequence world, only one central data
resource exists for protein structures, the Protein Data
Bank (PDB) [3]. Despite the undisputed advantage of
having all structural data available from one source in a
common file format, protein structures impose a new level of
complexity. They carry information about where in space
the adjacent residues of a protein sequence are located.
Furthermore, protein structures provide insights into the
spatial environment of an amino acid, which is different
from its sequence neighbourhood, as well as into its
interactions with other residues or heterogeneous ligands. This
wealth of information contains answers to questions as
diverse as to how proteins function or what compounds
may interact with a given protein. However, these answers
often remain inaccessible to a broader scientific
community.
To overcome this information gap, we developed
MolTalk. MolTalk consists of a programming library
implemented in Objective-C [4] that maps PDB structure files to
object space as well as of a scripting language based on
Smalltalk [5]. Moreover, MolTalk provides numerous
methods that enable both the novice as well as the expert
structural bioinformatician to rapidly develop software
tailored towards their individual needs and to allow for
novel insights from protein structure analyses. As an
application for MolTalk we describe PDBChainSaw, a
mirroring and data extraction routine for PDB files.
Implementation
MolTalk is composed of two functional parts: (1) the
programming library libmoltalk and (2) MolTalk, the
Smalltalk interpreter. The libmoltalk library implements classes
(Figure 1) in Objective-C [4] whereas the interpreter
MolTalk is based on StepTalk [5], a Smalltalk interpreter for
GNUstep [6]. The interpreter interacts with all classes
defined in libmoltalk and is used as a front end to this
library.
The classes implemented in libmoltalk are summarised in
groups, namely "structural", "mathematics", and
"others". Their complexity and flexibility vary as indicated by
the labels "Basic" and "Xtra" (Table 1). "Basic" classes can
be used by even novice users without special training,
whereas classes labelled "Xtra" indicate a higher level of
potential difficulty for a user, but allow often, at the same
time, a higher degree of flexibility in software
development (for details, please refer to the manual pages at http:/
/www.moltalk.org/Manual.html.
Each class consists of a set of methods, which again are
labelled either "Basic" or "Xtra". Independent of their
class, methods can be organised into (1) "basic features",
(2) "extended features", (3) "mathematical functions",
and (4) "others". "Basic features" enable mapping into
object space and querying. "Extended features" can be
further sub-divided into "operations" and "manipulations".
"Operations" include e.g. superimposition, structural
alignment, and transformation, respectively. With
"manipulations" chains, residues or atoms can be added
to or removed from a structure. "Mathematical functions"
allow the calculation of vectors and matrices to perform
spatial transformations. The features summarised in
"others" regulate input and output. In Table 1, a list of the
potentially most important methods and classes of the
group "Structure" is provided.
Results and Discussion
PDBChainSaw
Extracting and deriving knowledge from PDB files
remains a non-standard procedure to date. Therefore, we
developed MolTalk to provide and facilitate access to this
valuable information. As an example for a possible use of
MolTalk, we present PDBChainSaw, a relational database
of protein structure chains, which is used in the ModSNP
project to model (...truncated)