New developments on the cheminformatics open workflow environment CDK-Taverna (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1186%2F1758-2946-3-54.pdf

New developments on the cheminformatics open workflow environment CDK-Taverna

Andreas Truszkowski 2 Kalai Vanii Jayaseelan 0 Stefan Neumann 1 Egon L Willighagen 3 Achim Zielesny 2 Christoph Steinbeck 0 0 Chemoinformatics and Metabolism, European Bioinformatics Institute (EBI) , Cambridge, UK 1 GNWI - Gesellschaft fuer naturwissenschaftliche Informatik mbH, Oer-Erkenschwick, Germany 2 Institute for Bioinformatics and Cheminformatics, University of Applied Sciences of Gelsenkirchen , Recklinghausen, Germany 3 Division of Molecular Toxicology, Institute of Environmental Medicine, Karolinska Institutet , Stockholm, Sweden Background: The computational processing and analysis of small molecules is at heart of cheminformatics and structural bioinformatics and their application in e.g. metabolomics or drug discovery. Pipelining or workflow tools allow for the Lego-like, graphical assembly of I/O modules and algorithms into a complex workflow which can be easily deployed, modified and tested without the hassle of implementing it into a monolithic application. The CDK-Taverna project aims at building a free open-source cheminformatics pipelining solution through combination of different open-source projects such as Taverna, the Chemistry Development Kit (CDK) or the Waikato Environment for Knowledge Analysis (WEKA). A first integrated version 1.0 of CDK-Taverna was recently released to the public. Results: The CDK-Taverna project was migrated to the most up-to-date versions of its foundational software libraries with a complete re-engineering of its worker's architecture (version 2.0). 64-bit computing and multi-core usage by paralleled threads are now supported to allow for fast in-memory processing and analysis of large sets of molecules. Earlier deficiencies like workarounds for iterative data reading are removed. The combinatorial chemistry related reaction enumeration features are considerably enhanced. Additional functionality for calculating a natural product likeness score for small molecules is implemented to identify possible drug candidates. Finally the data analysis capabilities are extended with new workers that provide access to the open-source WEKA library for clustering and machine learning as well as training and test set partitioning. The new features are outlined with usage scenarios. Conclusions: CDK-Taverna 2.0 as an open-source cheminformatics workflow solution matured to become a freely available and increasingly powerful tool for the biosciences. The combination of the new CDK-Taverna worker family with the already available workflows developed by a lively Taverna community and published on myexperiment.org enables molecular scientists to quickly calculate, process and analyse molecular data as typically found in e.g. today's systems biology scenarios. - Background Current problems in the biosciences typically involve several domains of research. They require a scientist to work with different and diverse sets of data. The reconstruction of a metabolic network from sequencing data, for example, employs many of the data types found along the axis of the central dogma, including reconstruction of genome sequences, gene prediction, determination of encoded protein families, and from there to the substrates of enzymes, which then form the metabolic network. In order to work with such a processing pipeline, a scientist has to copy/paste and often transform the data between several bioinformatics web portals by hand. The manual approach involves repetitive tasks and cannot be considered effective or scalable. Especially the processing and analysis of small molecules comprises tasks like filtering, transformation, curation or migration of chemical data, information retrieval with substructures, reactions, or pharmacophores as well as the analysis of molecular data with statistics, clustering or machine learning to support chemical diversity requirements or to generate quantitative structure activity/property relationships (QSAR/QSPR models). These processing and analysis procedures itself are of increasing importance for research areas like metabolomics or drug discovery. The power and flexibility of the corresponding computational tools become essential success factors for the whole research process. The workflow paradigm addresses the above issues with the supply of sets of elementary workers (activities) that can be flexibly assembled in a graphical manner to allow complex procedures to be performed in an effective manner - without the need of specific code development or software programming skills. Scientific workflows allow the combination of a wide spectrum of algorithms and resources in a single workspace [1-3]. Earlier problems with iterations over large data sets [4] are completely resolved in version 2.0 due to new implementations in Taverna. Taverna 2 allows control structures such as while loops or if-then-else constructs. Termination criteria for loops may now be evaluated by listening to a state port [5]. In addition the user interface of the Taverna 2 workbench has clearly improved: The design and manipulation of workflows in a graphical workflow editor is now supported. Features like copy/paste and undo/redo simplify workflow creation and maintenance [6]. The CDK-Taverna project aims at building a free open-source cheminformatics pipelining solution through combination of different open-source projects such as Taverna [7], the Chemistry Development Kit (CDK) [8,9], or the Waikato Environment for Knowledge Analysis (WEKA) [10]. A first integrated version 1.0 of CDK-Taverna was recently released to the public [4]. To extend usability and power of CDK-Taverna for different molecular research purposes the development of version 2.0 was motivated. Implementation The CDK-Taverna 2.0 plug-in makes use of the Taverna plug-in manager for its installation. The manager fetches all necessary information about the plugin from a XML file which is located at http://www.tsconcepts.de/cdk-taverna2/plugin/. The information provided therein contains the name of the plug-in, its version, the repository location and the required Taverna version. Upon submitting the URL to the plug-in manager it downloads all necessary dependencies automatically from the web. After a subsequent restart the plug-in is enabled and the workers are visible in the services. The plug-in uses Taverna version 2.2.1 [6], CDK version 1.3.8 [11] and WEKA version 3.6.4 [12]. Like its predecessor it uses the Maven 2 build system [13] as well as the Taverna workbench for automated dependency management. CDK-Taverna 2.0 worker implementation The CDK-Taverna 2.0 plug-in is designed to be easily extendible: The implementation allows to create new workers by simply inheriting from the single abstract class org.openscience.cdk.applications. taverna.AbstractCDKActivity (which is the analogue of the CDKLocalWorker interface of CDKTaverna version 1.0). The class is located in the cdktaverna-2-activity module. It provides all necessary data for the underlying worker regi (...truncated)