The Bio* toolkits — a brief overview
Bioinformatics research is often difficult to do with commercial software. The Open Source BioPerl, BioPython and BioJava projects provide toolkits with multiple functionality that make it easier to create customised pipelines or analysis. This review briefly compares the quirks of the underlying languages and the functionality, documentation, utility and relative advantages of the Bio counterparts, particularly from the point of view of the beginning biologist programmer.
-
This article is directed to the beginning
bioinformaticist or biologist thinking of
learning a programming language to help
with their work. If you are familiar with
Perl, Python or Java, your decision is
probably already made, based on your
current preferred language. However, if
you have not already passed that
developmental checkpoint, this overview
may help you decide which one to
pursue.
Bioinformatics is a young science and
while there are a number of commercial
applications aimed at researchers in
biology, these are often not sufficient for
the level of data analysis required in
bioinformatics research. It was partly the
frustration with commercial suites that
drove the founding of the Bio groups.
(The Bio name uses the regular
expression operator to denote all
characters, shorthand for BioPerl, BioJava,
BioPython, etc.)
The Bio group (formally the Open
Bioinformatics Foundation1) was formed
by a group of self-described Perl hackers
who got together in 1995 to pool
resources for writing bioinformatics
software. The group saw that there was
much fine-grained functionality that was
extremely useful and if the program
source code could be shared, it could be
easily worked into functional programs.
The same idea gave birth to the
BioPython and BioJava groups in 1999
and the BioCORBA and BioDAS have
been added since. The BioRuby,2
BioLisp3 and Bioinformatics.org4 groups
share a similar vision and are worth
investigating for useful perspective and
resources, but are officially unaffiliated.
Before you dive into a long-term
commitment to a language, and its
Bioderivative, it is useful to see how it is
perceived by the various stakeholders.
Table 1 shows a quick and simple survey
based on scanning the Usenet
newsgroups, Google and Amazon.com as
a crude measure of languages popularity
and support.
Over the last 10 month period, the
membership of each group has grown by
about 50 per cent. By far the largest
number of posts is in the BioPerl group;
the BioJava group is gaining steam, and
the BioPython group tends to be quite a
bit lower, reflecting its lower membership
(see Table 2).
All the Bio projects described here use
the eponymous base language and endow
it with Bio features via additional
modules or libraries. A brief overview of
the base languages follows.
Perl, Python and Java are all
interpreted, which means that they are
slower than a compiled language such as
Language Amazon Perl Java
BioPerl-l
BioJava-l
BioPython
28th August 2001 21st June 2002 Total posts
C or C++ (typically three-quarters to
one-tenth as fast, depending on the type
of logic being implemented), but they are
hardly slouches. If speed is an issue, all of
them can be made to link to compiled
libraries via the Java Native Interface,
Pythons C interface and Perls XS
routines. The latter two can also make use
of SWIG,5 a more portable way of
interfacing polyglot code. Two examples
of how an interpreted language can be
used in high-performance computing are
PyMol,6 a molecular visualisation and
modelling application, and the Perl Data
Language7 which uses libraries of
compiled code and an object oriented
(OO) approach to allow very fast
computation on N-dimensional arrays.
In giving up the speed of compiled
code, all these languages are considerably
easier to program with. Mercifully, none
of them requires that you manually track
and manipulate memory allocation and
none requires (or even permits) use of the
much-hated memory pointer. As well,
many programming features or niceties
that in C you have to program yourself,
9,440,000
23,300,000
3,590,000
1,630,000
2,370,000
5,730,000
August 1996
September 1999
September 1999
are provided for you, such as associative
arrays, numeric interconversions, easy
input/output handling, string
manipulations and large numbers of
oftused programming expressions.
All have very good support for
network functionality and all support
regular expression (regex) pattern
matching8 although regex support is
integrated throughout Perls structure but
must be explicitly requested in Java and
Python. All provide extensive libraries to
connect to many relational databases.
Perls database independent module
allows nearly identical access to most
relational database management systems
(RDBMSs). Python has a similar
approach, using database-specific drivers
that present an identical application
programming interface (API) to the
programmer, and while it is less well
developed than Perls, it supports most of
the popular commercial and open source
RDBMSs. The Java DataBase
Connectivity (JDBC) is now a standard
part of the language that provides nearly
identical functionality and database
support to the ODBC drivers that
Windows uses to provide RDBMS
connectivity.
All three languages are multiplatform
they run similarly on the most current
versions of Unix, Linux, Windows and
the Mac. In addition, Perl and Python
qualify for the open source definition
they are freely available in source code
and while hundreds contribute to their
continued development, a single person
wrote the first few implementations and
remains the lead technical Godfather of
the project. Java, while made freely
available, is owned and defined by Sun
Microsystems, whose technical
committees decide what goes into Java
and when.
Java and Python have good support for
creating graphical user interfaces (GUIs).
Java uses its native Swing libraries; Python
uses a variety of multiplatform widgets
sets including Tk9 (bundled with Python),
wxWindows10 and Qt.11 While Perl can
be used to create GUIs (most easily with
Tk), it is a failing that is not nearly as well
supported as it is in Java or Python.
Perl does, however, have a non-trivial
advantage over Python and Java in that it
can be automatically upgraded and
enhanced using the CPAN module (for
Comprehensive Perl Archive Network),
included in the default installation. This
allows a user to request an additional
module to be retrieved, checked for
dependencies, have those dependencies
resolved automatically, and the entire tree
of dependencies automatically
downloaded, tested and installed all in a
single line of code. All this is available
without requiring the user know where
the files are archived. For example, to
install the BioPerl module and all of its
documentation (once the CPAN module
is easily configured), this is all you need to
do:
$ perl -MCPAN -e install Bio::Perl
The BioPerl installation will prompt you
about additional Perl libraries it needs for
some methods; m (...truncated)