New developments on the cheminformatics open workflow environment CDK-Taverna
Andreas Truszkowski
2
Kalai Vanii Jayaseelan
0
Stefan Neumann
1
Egon L Willighagen
3
Achim Zielesny
2
Christoph Steinbeck
0
0
Chemoinformatics and Metabolism,
European Bioinformatics Institute (EBI)
,
Cambridge, UK
1
GNWI - Gesellschaft fuer naturwissenschaftliche Informatik mbH, Oer-Erkenschwick,
Germany
2
Institute for Bioinformatics and Cheminformatics, University of Applied Sciences of Gelsenkirchen
, Recklinghausen,
Germany
3
Division of Molecular Toxicology, Institute of Environmental Medicine, Karolinska Institutet
,
Stockholm, Sweden
Background: The computational processing and analysis of small molecules is at heart of cheminformatics and structural bioinformatics and their application in e.g. metabolomics or drug discovery. Pipelining or workflow tools allow for the Lego-like, graphical assembly of I/O modules and algorithms into a complex workflow which can be easily deployed, modified and tested without the hassle of implementing it into a monolithic application. The CDK-Taverna project aims at building a free open-source cheminformatics pipelining solution through combination of different open-source projects such as Taverna, the Chemistry Development Kit (CDK) or the Waikato Environment for Knowledge Analysis (WEKA). A first integrated version 1.0 of CDK-Taverna was recently released to the public. Results: The CDK-Taverna project was migrated to the most up-to-date versions of its foundational software libraries with a complete re-engineering of its worker's architecture (version 2.0). 64-bit computing and multi-core usage by paralleled threads are now supported to allow for fast in-memory processing and analysis of large sets of molecules. Earlier deficiencies like workarounds for iterative data reading are removed. The combinatorial chemistry related reaction enumeration features are considerably enhanced. Additional functionality for calculating a natural product likeness score for small molecules is implemented to identify possible drug candidates. Finally the data analysis capabilities are extended with new workers that provide access to the open-source WEKA library for clustering and machine learning as well as training and test set partitioning. The new features are outlined with usage scenarios. Conclusions: CDK-Taverna 2.0 as an open-source cheminformatics workflow solution matured to become a freely available and increasingly powerful tool for the biosciences. The combination of the new CDK-Taverna worker family with the already available workflows developed by a lively Taverna community and published on myexperiment.org enables molecular scientists to quickly calculate, process and analyse molecular data as typically found in e.g. today's systems biology scenarios.
-
Background
Current problems in the biosciences typically involve
several domains of research. They require a scientist to
work with different and diverse sets of data. The
reconstruction of a metabolic network from sequencing data,
for example, employs many of the data types found
along the axis of the central dogma, including
reconstruction of genome sequences, gene prediction,
determination of encoded protein families, and from there to
the substrates of enzymes, which then form the
metabolic network. In order to work with such a processing
pipeline, a scientist has to copy/paste and often
transform the data between several bioinformatics web
portals by hand. The manual approach involves repetitive
tasks and cannot be considered effective or scalable.
Especially the processing and analysis of small
molecules comprises tasks like filtering, transformation,
curation or migration of chemical data, information retrieval
with substructures, reactions, or pharmacophores as
well as the analysis of molecular data with statistics,
clustering or machine learning to support chemical
diversity requirements or to generate quantitative
structure activity/property relationships (QSAR/QSPR
models). These processing and analysis procedures itself
are of increasing importance for research areas like
metabolomics or drug discovery. The power and
flexibility of the corresponding computational tools become
essential success factors for the whole research process.
The workflow paradigm addresses the above issues
with the supply of sets of elementary workers (activities)
that can be flexibly assembled in a graphical manner to
allow complex procedures to be performed in an
effective manner - without the need of specific code
development or software programming skills. Scientific
workflows allow the combination of a wide spectrum of
algorithms and resources in a single workspace [1-3].
Earlier problems with iterations over large data sets [4]
are completely resolved in version 2.0 due to new
implementations in Taverna. Taverna 2 allows control
structures such as while loops or if-then-else constructs.
Termination criteria for loops may now be evaluated by
listening to a state port [5]. In addition the user
interface of the Taverna 2 workbench has clearly improved:
The design and manipulation of workflows in a
graphical workflow editor is now supported. Features like
copy/paste and undo/redo simplify workflow creation
and maintenance [6].
The CDK-Taverna project aims at building a free
open-source cheminformatics pipelining solution
through combination of different open-source projects
such as Taverna [7], the Chemistry Development Kit
(CDK) [8,9], or the Waikato Environment for
Knowledge Analysis (WEKA) [10]. A first integrated version
1.0 of CDK-Taverna was recently released to the public
[4]. To extend usability and power of CDK-Taverna for
different molecular research purposes the development
of version 2.0 was motivated.
Implementation
The CDK-Taverna 2.0 plug-in makes use of the
Taverna plug-in manager for its installation. The
manager fetches all necessary information about the
plugin from a XML file which is located at
http://www.tsconcepts.de/cdk-taverna2/plugin/. The information
provided therein contains the name of the plug-in, its
version, the repository location and the required
Taverna version. Upon submitting the URL to the
plug-in manager it downloads all necessary
dependencies automatically from the web. After a subsequent
restart the plug-in is enabled and the workers are
visible in the services. The plug-in uses Taverna version
2.2.1 [6], CDK version 1.3.8 [11] and WEKA version
3.6.4 [12]. Like its predecessor it uses the Maven 2
build system [13] as well as the Taverna workbench
for automated dependency management.
CDK-Taverna 2.0 worker implementation
The CDK-Taverna 2.0 plug-in is designed to be easily
extendible: The implementation allows to create new
workers by simply inheriting from the single abstract
class org.openscience.cdk.applications.
taverna.AbstractCDKActivity (which is the
analogue of the CDKLocalWorker interface of
CDKTaverna version 1.0). The class is located in the
cdktaverna-2-activity module. It provides all
necessary data for the underlying worker regi (...truncated)