Scientific workflow systems: Pipeline Pilot and KNIME
Wendy A. Warr
0
0
W. A. Warr (&) Wendy Warr & Associates, Holmes Chapel, Crewe, Cheshire CW4 7HZ,
UK
Pipeline Pilot are now the market leaders in personal productivity in cheminformatics.
-
In Pipeline Pilot, users can graphically compose protocols,
using hundreds of different configurable components for
operations such as data retrieval, manipulation,
computational filtering, and display [11]. KNIME has a graphical
user interface for combining nodes [12]. Collections of
nodes are known as extensions. KNIME is based on the
Eclipse [13] open source platform, and Java. Java is part of
the foundation of Pipeline Pilot and programmers can
create new components with the Java components API or
write new clients against the Java SDK. In addition,
Pipeline Pilot has its own scripting language (for
nonprogrammers); it has much more cheminformatics
technology built in and scripting is more concise. There were
initially very few chemistry nodes for use with KNIME,
and adding a new one required Java programming, but
many more nodes are now being added.
KNIME uses a workflow methodology in which task 1 is
completed then the data are handed off to task 2 which is
completed before the data are handed on to task 3 and so
on. In pipelining (as in Pipeline Pilot), task 1 is completed
on compound 1 and the data are passed to task 2. Task 1
can then start on the next compound. In short, the data
stream from 1 to 2. The process can scale without impact
on memory, and efficiency is gained if a downstream
operation can be commenced on some records while an
upstream operation is still working on others. The
table-bytable processing of KNIME offers benefits such as multiple
iterations over the same data (important for many data
mining algorithms); the ability always to view intermediate
results on the connections between nodes even after the
workflow has been executed; and the ability to restart the
workflow at any intermediate node. The penalty is the need
to store the data somewhere, but it is easier to cache the
data at the end of each task. In data pipelining, a cache of
all the data can be added as a finish here and resume
component.
KNIME came into the market from a data mining
background while Pipeline Pilot came from
cheminformatics. In practice, Pipeline Pilot and KNIME are
complementary. In some markets they do not compete at all
(Pipeline Pilot is not aimed at non-scientific applications)
and Pipeline Pilot has a separate role to play within
Accelrys software portfolio. As a gross generalization, users
say that Pipeline Pilot is very expensive but easy to use,
while KNIME is free (or less expensive), and less easy to
use. KNIME would counter by saying that ease of use is
a subjective criterion, and familiarity with another system
may have a bearing on it. Accelrys would argue that on a
total cost of ownership basis (including factors such as IT
costs, developer costs, and support levels), rather than
initial purchase cost alone, the differential between
Pipeline Pilot and KNIME is not as great as it first appears.
Pipeline Pilot is very memory efficient; KNIME is not as
scalable (although that issue is being addressed, as we shall
see later). Some might say that Pipeline Pilot is
professional while KNIME suffers from its non-commercial
background, but others actually prefer the open source
nature and community spirit of KNIME.
At the outset in 2004, KNIME was not aimed specifically at
cheminformatics but it was initially taken up by the
cheminformatics community. Nowadays fifty percent of users are
in other disciplines; KNIME is a business intelligence or
predictive analytics product. The Economist uses KNIME
in customer relationship management. A major telecom uses
it in social networks and text mining. Private banks in Zurich
and the Grand Casino in Lucerne use it. The Pasteur Institute
is adding sequencing extensions. Small biotechs and pharma
are using KNIME. In all there are 9,000 registered users or
organizations and over 500,000 copies have been
downloaded [14].
KNIME is not sold: it is free. The business model [15]
involves licensing enterprise components allowing users to
exchange workflows and build Web portals. KNIME
already has a free reporting engine, for example, but for
corporate-wide use, the KNIME server adds value with the
WebPortal. Users who do not pay for KNIME can call Web
Services but enterprise users can make better use of Web
Services. The company was formed in 2006 and moved to
Zurich in 2008. It is small but profitable: it has 15 staff,
most of them involved in technical development. There are
currently three openings for new staff to relieve developers
of commercial pressures, but KNIME does not need a big
sales and marketing resource: companies call up KNIME
spontaneously. Often they are already using the free
software.
KNIME has a non-exclusive technology partnership
with Perkin Elmer Informatics which supports the
enterprise KNIME solution and is a global distributor. (There
are five other local distributors.) Fifteen KNIME partners
(e.g., Schrodinger, Tripos, Infocom (for ChemAxon),
BioSolveIT, Chemical Computing Group, Cresset,
Dotmatics, Molecular Discovery, and Molegro) have added
cheminformatics tools. There is a KNIMESpotfire bridge
that allows users to call KNIME workflows from within
Spotfire. From 2010, a community aspect [16] has been
growing. CDK [17], Indigo [18] and RDKit [19] nodes
have been added, the RDKit ones through Novartis.
KNIME has a broad spectrum of contributors from
software vendors, academia and pharma. Collaborations
between pharmaceutical companies are becoming more
commonplace: OpenPHACTS [20] and the Pistoia Alliance
[21] are examples. There is an informal KNIME
precompetitive pharma group that includes teams within
AstraZeneca, Boehringer Ingelheim, Evotec, Lilly,
Novartis, Pfizer, Sanofi-Aventis, Syngenta and Vernalis.
Through Erl Wood Informatics, the Lilly group has made
30 nodes open source. These include format converters,
fingerprinting, docking, viewers, R-group analysis,
matched pairs, scoring and ranking, multi-objective
optimization, reaction vectors [22] and activity cliffs.
Many proprietary nodes are available in-house at Lilly.
Using KNIME, Lilly can present a common interface for
BioSolveIT and Cresset software; cheminformatics and
data mining tools can all be mixed in one desktop
environment. Companies use KNIME in different ways. Lilly
uses it to deliver applications to chemists while other
companies use it more in computational chemistry.
Novartis has developed a collection of KNIME nodes for
working with internally developed algorithms and services.
Beyond computational chemistry, KNIME is used within
the Novartis Research IT department for tasks like tracking
and reporting on the utilization of high-performance
computing resources. At Boehringer-Ingelheim both
PipelinePilot and KNIME are used. Automated calculation engines
are partly deployed via KNIME; nodes developed in-house
are used to i (...truncated)