XiP: a computational environment to create, extend and share workflows
BIOINFORMATICS
APPLICATIONS NOTE
Systems biology
Vol. 29 no. 1 2013, pages 137–139
doi:10.1093/bioinformatics/bts630
Advance Access publication October 25, 2012
XiP: a computational environment to create, extend and share
workflows
Masao Nagasaki1,*, André Fujita2, Yayoi Sekiya3, Ayumu Saito3, Emi Ikeda3, Chen Li3 and
Satoru Miyano3
1
Associate Editor: Martin Bishop
ABSTRACT
XiP (eXtensible integrative Pipeline) is a flexible, editable and modular
environment with a user-friendly interface that does not require
previous advanced programming skills to run, construct and edit
workflows. XiP allows the construction of workflows by linking components written in both R and Java, the analysis of high-throughput
data in grid engine systems and also the development of customized
pipelines that can be encapsulated in a package and distributed. XiP
already comes with several ready-to-use pipeline flows for the most
common genomic and transcriptomic analysis and 300 computational components.
Availability: XiP is open source, freely available under the Lesser
General Public License (LGPL) and can be downloaded from http://
xip.hgc.jp.
Contact:
Received on May 10, 2012; revised on October 10, 2012; accepted on
October 17, 2012
1 INTRODUCTION
Large-scale sequencing and microarray technologies are
high-throughput methodologies that generate huge genomic
and transcriptomic data that must be processed in a multi-step
fashion. Usually, it is carried out by several distinct programs
that are interconnected in a specific order, forming a workflow
process, namely pipeline (Durham et al., 2004; Fujita et al.,
2007). For example, a simple workflow to investigate genes potentially related to cancer might begin with microarray image
analysis, normalization, statistical tests to identify differentially
expressed genes between the normal and the tumor tissues followed by a multiple test P-value correction.
Shah et al. (2004) have described that pipelines must satisfy at
least three characteristics: (i) flexibility: a software can be used to
analyze different data sets that may require different analysis
tools; (ii) integrability: a system should provide the framework
to facilitate data integration of analysis results from different
tools; and (iii) extensibility: a system needs to allow for the inclusion of new tools in a modular fashion.
In addition to these characteristics that are actually necessary
to a pipeline, we believe that the portability with grid engines and
*To whom correspondence should be addressed.
the interoperability with pre-existing systems are also important
in this new era of generation of high-throughput data. The portability with grid engines makes possible to run heavy routines in
supercomputers (hundreds of cores) in an easy manner while the
interoperability allows the use of workflows constructed under
different platforms.
To facilitate the construction of workflows, we present XiP
(eXtensible integrative Pipeline), a free [under the Lesser
General Public License (LGPL)] and easy-to-use environment
designed to integrate the state-of-the-art computational methods
and to satisfy researchers’ need in multi-collaborative projects.
2
IMPLEMENTATION
XiP was entirely developed in Java and runs at the client’s machine via the Java Web Start technology. In other words, XiP
runs in the majority of operating systems, requiring only a
pre-installation of the Java Runtime Environment (JRE version
1.6) at the client’s machine. If JRE is not installed, the installation package asks for the permission to install JRE. Although
XiP was originally designed to run via the Web, it can also be
installed in local machines.
XiP already comes with 300 components, where each component represents one computational algorithm (e.g. Support
Vector Machine, k-means, t-test, etc). XiP also recognizes components written in R (R Development Core Team, 2011), one of
the most popular statistical programming languages in
Bioinformatics.
For data input, XiP accepts any Java and R basic data structures, Cell System Markup Language (CSML) (Nagasaki et al.,
2010), Cell System Ontology (CSO) (Jeong et al., 2007), Cell
System Markup Language Data Base (CSMLDB) and CSODB
formats.
The complete list of components that comes with XiP (300
components), tutorials, documentation and some example pipelines are available at the XiP project webpage (http://xip.hgc.jp).
3
RESULTS AND DISCUSSIONS
With the advances in the generation of high-throughput data and
the development of large-scale projects, which involve dozens of
labs around the world, computational pipelines become crucial
and indispensable, especially when the same protocol must be
ß The Author 2012. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits
unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Department of Integrative Genomics, Tohoku Medical Megabank Organization, Tohoku University, Japan, 2Department
of Computer Science, Institute of Mathematics and Statistics, University of São Paulo, Brazil and 3Human Genome
Center, Institute of Medical Science, University of Tokyo, Japan
M.Nagasaki et al.
(1) Flexibility: The specific requirements of a research project
make it difficult to use a pipeline designed for a particular
data set for analysis of another data set. As a result, two
different pipelines must be constructed, both sharing several common components. However, notice that it is not
necessary to reconstruct the entire pipeline, but only the
different parts. As the pipelines constructed by XiP are
modular, i.e. the pipelines are composed of an ordered
sequence of components, one must replace only the different components to adapt the pipeline to a new data set.
(2) Integrability: Components written in both R and Java programming languages run on XiP. Internally, XiP translates
the R data structures into Java structures, thus allowing
the connection of packages available at the R webpage
(http://www.r-project.org)
(Holland et al., 2008).
and
the
BioJava
project
(3) Extensibility: R and Java functions developed by different
groups can automatically be translated to a XiP component and included in the platform. Therefore, XiP can be
customized and extended with several components depending on the user’s necessities.
(4) Portability to grid engine: The analysis of large amounts of
data generated by the new technological approaches in
molecular biology requires high-performance computational resources. The XiP platform allows the construction
of pipelines that use grid engines to parallelize computational jobs. To run a parallel job, the user must set up a
cluster (server) with several cores and log in to this remote
server. The integration with grid engines makes XiP suitable for individual researchers with modest (...truncated)