XiP: a computational environment to create, extend and share workflows (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/bioinformatics/article-pdf/29/1/137/49060286/bioinformatics_29_1_137.pdf

XiP: a computational environment to create, extend and share workflows

BIOINFORMATICS APPLICATIONS NOTE Systems biology Vol. 29 no. 1 2013, pages 137–139 doi:10.1093/bioinformatics/bts630 Advance Access publication October 25, 2012 XiP: a computational environment to create, extend and share workflows Masao Nagasaki1,*, André Fujita2, Yayoi Sekiya3, Ayumu Saito3, Emi Ikeda3, Chen Li3 and Satoru Miyano3 1 Associate Editor: Martin Bishop ABSTRACT XiP (eXtensible integrative Pipeline) is a flexible, editable and modular environment with a user-friendly interface that does not require previous advanced programming skills to run, construct and edit workflows. XiP allows the construction of workflows by linking components written in both R and Java, the analysis of high-throughput data in grid engine systems and also the development of customized pipelines that can be encapsulated in a package and distributed. XiP already comes with several ready-to-use pipeline flows for the most common genomic and transcriptomic analysis and 300 computational components. Availability: XiP is open source, freely available under the Lesser General Public License (LGPL) and can be downloaded from http:// xip.hgc.jp. Contact: Received on May 10, 2012; revised on October 10, 2012; accepted on October 17, 2012 1 INTRODUCTION Large-scale sequencing and microarray technologies are high-throughput methodologies that generate huge genomic and transcriptomic data that must be processed in a multi-step fashion. Usually, it is carried out by several distinct programs that are interconnected in a specific order, forming a workflow process, namely pipeline (Durham et al., 2004; Fujita et al., 2007). For example, a simple workflow to investigate genes potentially related to cancer might begin with microarray image analysis, normalization, statistical tests to identify differentially expressed genes between the normal and the tumor tissues followed by a multiple test P-value correction. Shah et al. (2004) have described that pipelines must satisfy at least three characteristics: (i) flexibility: a software can be used to analyze different data sets that may require different analysis tools; (ii) integrability: a system should provide the framework to facilitate data integration of analysis results from different tools; and (iii) extensibility: a system needs to allow for the inclusion of new tools in a modular fashion. In addition to these characteristics that are actually necessary to a pipeline, we believe that the portability with grid engines and *To whom correspondence should be addressed. the interoperability with pre-existing systems are also important in this new era of generation of high-throughput data. The portability with grid engines makes possible to run heavy routines in supercomputers (hundreds of cores) in an easy manner while the interoperability allows the use of workflows constructed under different platforms. To facilitate the construction of workflows, we present XiP (eXtensible integrative Pipeline), a free [under the Lesser General Public License (LGPL)] and easy-to-use environment designed to integrate the state-of-the-art computational methods and to satisfy researchers’ need in multi-collaborative projects. 2 IMPLEMENTATION XiP was entirely developed in Java and runs at the client’s machine via the Java Web Start technology. In other words, XiP runs in the majority of operating systems, requiring only a pre-installation of the Java Runtime Environment (JRE version 1.6) at the client’s machine. If JRE is not installed, the installation package asks for the permission to install JRE. Although XiP was originally designed to run via the Web, it can also be installed in local machines. XiP already comes with 300 components, where each component represents one computational algorithm (e.g. Support Vector Machine, k-means, t-test, etc). XiP also recognizes components written in R (R Development Core Team, 2011), one of the most popular statistical programming languages in Bioinformatics. For data input, XiP accepts any Java and R basic data structures, Cell System Markup Language (CSML) (Nagasaki et al., 2010), Cell System Ontology (CSO) (Jeong et al., 2007), Cell System Markup Language Data Base (CSMLDB) and CSODB formats. The complete list of components that comes with XiP (300 components), tutorials, documentation and some example pipelines are available at the XiP project webpage (http://xip.hgc.jp). 3 RESULTS AND DISCUSSIONS With the advances in the generation of high-throughput data and the development of large-scale projects, which involve dozens of labs around the world, computational pipelines become crucial and indispensable, especially when the same protocol must be ß The Author 2012. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Department of Integrative Genomics, Tohoku Medical Megabank Organization, Tohoku University, Japan, 2Department of Computer Science, Institute of Mathematics and Statistics, University of São Paulo, Brazil and 3Human Genome Center, Institute of Medical Science, University of Tokyo, Japan M.Nagasaki et al. (1) Flexibility: The specific requirements of a research project make it difficult to use a pipeline designed for a particular data set for analysis of another data set. As a result, two different pipelines must be constructed, both sharing several common components. However, notice that it is not necessary to reconstruct the entire pipeline, but only the different parts. As the pipelines constructed by XiP are modular, i.e. the pipelines are composed of an ordered sequence of components, one must replace only the different components to adapt the pipeline to a new data set. (2) Integrability: Components written in both R and Java programming languages run on XiP. Internally, XiP translates the R data structures into Java structures, thus allowing the connection of packages available at the R webpage (http://www.r-project.org) (Holland et al., 2008). and the BioJava project (3) Extensibility: R and Java functions developed by different groups can automatically be translated to a XiP component and included in the platform. Therefore, XiP can be customized and extended with several components depending on the user’s necessities. (4) Portability to grid engine: The analysis of large amounts of data generated by the new technological approaches in molecular biology requires high-performance computational resources. The XiP platform allows the construction of pipelines that use grid engines to parallelize computational jobs. To run a parallel job, the user must set up a cluster (server) with several cores and log in to this remote server. The integration with grid engines makes XiP suitable for individual researchers with modest (...truncated)