Scientific workflow systems: Pipeline Pilot and KNIME (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007%2Fs10822-012-9577-7.pdf

Scientific workflow systems: Pipeline Pilot and KNIME

Wendy A. Warr 0 0 W. A. Warr (&) Wendy Warr & Associates, Holmes Chapel, Crewe, Cheshire CW4 7HZ, UK Pipeline Pilot are now the market leaders in personal productivity in cheminformatics. - In Pipeline Pilot, users can graphically compose protocols, using hundreds of different configurable components for operations such as data retrieval, manipulation, computational filtering, and display [11]. KNIME has a graphical user interface for combining nodes [12]. Collections of nodes are known as extensions. KNIME is based on the Eclipse [13] open source platform, and Java. Java is part of the foundation of Pipeline Pilot and programmers can create new components with the Java components API or write new clients against the Java SDK. In addition, Pipeline Pilot has its own scripting language (for nonprogrammers); it has much more cheminformatics technology built in and scripting is more concise. There were initially very few chemistry nodes for use with KNIME, and adding a new one required Java programming, but many more nodes are now being added. KNIME uses a workflow methodology in which task 1 is completed then the data are handed off to task 2 which is completed before the data are handed on to task 3 and so on. In pipelining (as in Pipeline Pilot), task 1 is completed on compound 1 and the data are passed to task 2. Task 1 can then start on the next compound. In short, the data stream from 1 to 2. The process can scale without impact on memory, and efficiency is gained if a downstream operation can be commenced on some records while an upstream operation is still working on others. The table-bytable processing of KNIME offers benefits such as multiple iterations over the same data (important for many data mining algorithms); the ability always to view intermediate results on the connections between nodes even after the workflow has been executed; and the ability to restart the workflow at any intermediate node. The penalty is the need to store the data somewhere, but it is easier to cache the data at the end of each task. In data pipelining, a cache of all the data can be added as a finish here and resume component. KNIME came into the market from a data mining background while Pipeline Pilot came from cheminformatics. In practice, Pipeline Pilot and KNIME are complementary. In some markets they do not compete at all (Pipeline Pilot is not aimed at non-scientific applications) and Pipeline Pilot has a separate role to play within Accelrys software portfolio. As a gross generalization, users say that Pipeline Pilot is very expensive but easy to use, while KNIME is free (or less expensive), and less easy to use. KNIME would counter by saying that ease of use is a subjective criterion, and familiarity with another system may have a bearing on it. Accelrys would argue that on a total cost of ownership basis (including factors such as IT costs, developer costs, and support levels), rather than initial purchase cost alone, the differential between Pipeline Pilot and KNIME is not as great as it first appears. Pipeline Pilot is very memory efficient; KNIME is not as scalable (although that issue is being addressed, as we shall see later). Some might say that Pipeline Pilot is professional while KNIME suffers from its non-commercial background, but others actually prefer the open source nature and community spirit of KNIME. At the outset in 2004, KNIME was not aimed specifically at cheminformatics but it was initially taken up by the cheminformatics community. Nowadays fifty percent of users are in other disciplines; KNIME is a business intelligence or predictive analytics product. The Economist uses KNIME in customer relationship management. A major telecom uses it in social networks and text mining. Private banks in Zurich and the Grand Casino in Lucerne use it. The Pasteur Institute is adding sequencing extensions. Small biotechs and pharma are using KNIME. In all there are 9,000 registered users or organizations and over 500,000 copies have been downloaded [14]. KNIME is not sold: it is free. The business model [15] involves licensing enterprise components allowing users to exchange workflows and build Web portals. KNIME already has a free reporting engine, for example, but for corporate-wide use, the KNIME server adds value with the WebPortal. Users who do not pay for KNIME can call Web Services but enterprise users can make better use of Web Services. The company was formed in 2006 and moved to Zurich in 2008. It is small but profitable: it has 15 staff, most of them involved in technical development. There are currently three openings for new staff to relieve developers of commercial pressures, but KNIME does not need a big sales and marketing resource: companies call up KNIME spontaneously. Often they are already using the free software. KNIME has a non-exclusive technology partnership with Perkin Elmer Informatics which supports the enterprise KNIME solution and is a global distributor. (There are five other local distributors.) Fifteen KNIME partners (e.g., Schrodinger, Tripos, Infocom (for ChemAxon), BioSolveIT, Chemical Computing Group, Cresset, Dotmatics, Molecular Discovery, and Molegro) have added cheminformatics tools. There is a KNIMESpotfire bridge that allows users to call KNIME workflows from within Spotfire. From 2010, a community aspect [16] has been growing. CDK [17], Indigo [18] and RDKit [19] nodes have been added, the RDKit ones through Novartis. KNIME has a broad spectrum of contributors from software vendors, academia and pharma. Collaborations between pharmaceutical companies are becoming more commonplace: OpenPHACTS [20] and the Pistoia Alliance [21] are examples. There is an informal KNIME precompetitive pharma group that includes teams within AstraZeneca, Boehringer Ingelheim, Evotec, Lilly, Novartis, Pfizer, Sanofi-Aventis, Syngenta and Vernalis. Through Erl Wood Informatics, the Lilly group has made 30 nodes open source. These include format converters, fingerprinting, docking, viewers, R-group analysis, matched pairs, scoring and ranking, multi-objective optimization, reaction vectors [22] and activity cliffs. Many proprietary nodes are available in-house at Lilly. Using KNIME, Lilly can present a common interface for BioSolveIT and Cresset software; cheminformatics and data mining tools can all be mixed in one desktop environment. Companies use KNIME in different ways. Lilly uses it to deliver applications to chemists while other companies use it more in computational chemistry. Novartis has developed a collection of KNIME nodes for working with internally developed algorithms and services. Beyond computational chemistry, KNIME is used within the Novartis Research IT department for tasks like tracking and reporting on the utilization of high-performance computing resources. At Boehringer-Ingelheim both PipelinePilot and KNIME are used. Automated calculation engines are partly deployed via KNIME; nodes developed in-house are used to i (...truncated)