The Bio* toolkits — a brief overview

Briefings in Bioinformatics, Sep 2002

Bioinformatics research is often difficult to do with commercial software. The Open Source BioPerl, BioPython and BioJava projects provide toolkits with multiple functionality that make it easier to create customised pipelines or analysis. This review briefly compares the quirks of the underlying languages and the functionality, documentation, utility and relative advantages of the Bio counterparts, particularly from the point of view of the beginning biologist programmer.

Article PDF cannot be displayed. You can download it here:

https://bib.oxfordjournals.org/content/3/3/296.full.pdf

The Bio* toolkits — a brief overview

Bioinformatics research is often difficult to do with commercial software. The Open Source BioPerl, BioPython and BioJava projects provide toolkits with multiple functionality that make it easier to create customised pipelines or analysis. This review briefly compares the quirks of the underlying languages and the functionality, documentation, utility and relative advantages of the Bio counterparts, particularly from the point of view of the beginning biologist programmer. - This article is directed to the beginning bioinformaticist or biologist thinking of learning a programming language to help with their work. If you are familiar with Perl, Python or Java, your decision is probably already made, based on your current preferred language. However, if you have not already passed that developmental checkpoint, this overview may help you decide which one to pursue. Bioinformatics is a young science and while there are a number of commercial applications aimed at researchers in biology, these are often not sufficient for the level of data analysis required in bioinformatics research. It was partly the frustration with commercial suites that drove the founding of the Bio groups. (The Bio name uses the regular expression operator to denote all characters, shorthand for BioPerl, BioJava, BioPython, etc.) The Bio group (formally the Open Bioinformatics Foundation1) was formed by a group of self-described Perl hackers who got together in 1995 to pool resources for writing bioinformatics software. The group saw that there was much fine-grained functionality that was extremely useful and if the program source code could be shared, it could be easily worked into functional programs. The same idea gave birth to the BioPython and BioJava groups in 1999 and the BioCORBA and BioDAS have been added since. The BioRuby,2 BioLisp3 and Bioinformatics.org4 groups share a similar vision and are worth investigating for useful perspective and resources, but are officially unaffiliated. Before you dive into a long-term commitment to a language, and its Bioderivative, it is useful to see how it is perceived by the various stakeholders. Table 1 shows a quick and simple survey based on scanning the Usenet newsgroups, Google and Amazon.com as a crude measure of languages popularity and support. Over the last 10 month period, the membership of each group has grown by about 50 per cent. By far the largest number of posts is in the BioPerl group; the BioJava group is gaining steam, and the BioPython group tends to be quite a bit lower, reflecting its lower membership (see Table 2). All the Bio projects described here use the eponymous base language and endow it with Bio features via additional modules or libraries. A brief overview of the base languages follows. Perl, Python and Java are all interpreted, which means that they are slower than a compiled language such as Language Amazon Perl Java BioPerl-l BioJava-l BioPython 28th August 2001 21st June 2002 Total posts C or C++ (typically three-quarters to one-tenth as fast, depending on the type of logic being implemented), but they are hardly slouches. If speed is an issue, all of them can be made to link to compiled libraries via the Java Native Interface, Pythons C interface and Perls XS routines. The latter two can also make use of SWIG,5 a more portable way of interfacing polyglot code. Two examples of how an interpreted language can be used in high-performance computing are PyMol,6 a molecular visualisation and modelling application, and the Perl Data Language7 which uses libraries of compiled code and an object oriented (OO) approach to allow very fast computation on N-dimensional arrays. In giving up the speed of compiled code, all these languages are considerably easier to program with. Mercifully, none of them requires that you manually track and manipulate memory allocation and none requires (or even permits) use of the much-hated memory pointer. As well, many programming features or niceties that in C you have to program yourself, 9,440,000 23,300,000 3,590,000 1,630,000 2,370,000 5,730,000 August 1996 September 1999 September 1999 are provided for you, such as associative arrays, numeric interconversions, easy input/output handling, string manipulations and large numbers of oftused programming expressions. All have very good support for network functionality and all support regular expression (regex) pattern matching8 although regex support is integrated throughout Perls structure but must be explicitly requested in Java and Python. All provide extensive libraries to connect to many relational databases. Perls database independent module allows nearly identical access to most relational database management systems (RDBMSs). Python has a similar approach, using database-specific drivers that present an identical application programming interface (API) to the programmer, and while it is less well developed than Perls, it supports most of the popular commercial and open source RDBMSs. The Java DataBase Connectivity (JDBC) is now a standard part of the language that provides nearly identical functionality and database support to the ODBC drivers that Windows uses to provide RDBMS connectivity. All three languages are multiplatform they run similarly on the most current versions of Unix, Linux, Windows and the Mac. In addition, Perl and Python qualify for the open source definition they are freely available in source code and while hundreds contribute to their continued development, a single person wrote the first few implementations and remains the lead technical Godfather of the project. Java, while made freely available, is owned and defined by Sun Microsystems, whose technical committees decide what goes into Java and when. Java and Python have good support for creating graphical user interfaces (GUIs). Java uses its native Swing libraries; Python uses a variety of multiplatform widgets sets including Tk9 (bundled with Python), wxWindows10 and Qt.11 While Perl can be used to create GUIs (most easily with Tk), it is a failing that is not nearly as well supported as it is in Java or Python. Perl does, however, have a non-trivial advantage over Python and Java in that it can be automatically upgraded and enhanced using the CPAN module (for Comprehensive Perl Archive Network), included in the default installation. This allows a user to request an additional module to be retrieved, checked for dependencies, have those dependencies resolved automatically, and the entire tree of dependencies automatically downloaded, tested and installed all in a single line of code. All this is available without requiring the user know where the files are archived. For example, to install the BioPerl module and all of its documentation (once the CPAN module is easily configured), this is all you need to do: $ perl -MCPAN -e install Bio::Perl The BioPerl installation will prompt you about additional Perl libraries it needs for some methods; m (...truncated)


This is a preview of a remote PDF: https://bib.oxfordjournals.org/content/3/3/296.full.pdf
Article home page: http://bib.oxfordjournals.org/content/3/3/296.abstract

Harry Mangalam. The Bio* toolkits — a brief overview, Briefings in Bioinformatics, 2002, pp. 296-302, 3/3, DOI: 10.1093/bib/3.3.296