Fast parsers for Entrez Gene
Fast parsers for Entrez Gene Mingyi Liu
0
Andrei Grigoriev
0
0
GPC Biotech AG, Fraunhoferstrasse 20, 82152 Martinsried,
Germany
Summary: NCBI completed the transition of its main genome annotation database from Locuslink to Entrez Gene in Spring 2005. However, to this date few parsers exist for the Entrez Gene annotation file. Owing to the widespread use of Locuslink and the popularity of Perl programming language in bioinformatics, a publicly available high performance Entrez Gene parser in Perl is urgently needed. We present four such parsers that were developed using several parsing approaches (Parse::RecDescent, Parse::Yapp, Perl-byacc and Perl 5 regular expressions) and provide the first in-depth comparison of these sophisticated Perl tools. Our fastest parser processes the entire human Entrez Gene annotation file in under 12 min on one Intel Xeon 2.4 GHz CPU and can be of help to the bioinformatics community during and after the transition from Locuslink to Entrez Gene. Availability: Source codes are available under the Perl and GNU public license at http://sourceforge.net/projects/egparser/ Contact: The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email:
-
INTRODUCTION
The National Center for Biotechnology Information (NCBI)
completed the transition from Locuslink (Pruitt and Maglott, 2001) to
Entrez Gene (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=
gene) (Maglott et al., 2005) in Spring 2005. Thus, there is an urgent
need for parsers for the ASN.1-formatted (http://www.ncbi.nlm.nih.
gov/Sitemap/Summary/asn1.html) Entrez Gene annotation file.
However, despite the immense popularity of Perl programming
language among bioinformatics researchers, there is currently no
publicly available Perl parser for either the Entrez Gene annotation file or
ASN.1 text files in general. The NCBI Entrez Gene parser has a rather
steep learning curve and is available only in the C/C++-based
toolbox (http://ncbi.nih.gov/IEB/ToolBox/index.cgi/). The very latest
gene2xml tool from NCBI provides Perl users an indirect way to
process Entrez Gene data as it could convert the binary ASN-formatted
Entrez Gene files to XML format. However, the Entrez Gene XML
format (http://www.ncbi.nlm.nih.gov/dtd/) is rather complex and
difficult to use. Its storage and processing also consume significant
computational resources. In contrast, an object-oriented Perl parser
of the ASN-formatted Entrez Gene files would be efficient, easy to
use and could interface well with the vast number of public
bioinformatics tools, e.g. Bioperl (http://bioperl.org/) (Stajich et al., 2002)
and EnsEMBL (http://www.ensembl.org) (Birney et al., 2004), and
any in-house tools developed in Perl. Compared with C/C++-based
tools, pure Perl parsers support all the operating systems that Perl
has a virtual machine for and require no effort of porting. We also
placed a very strong emphasis on performance optimizations when
creating our Perl parsers using four different approaches, which
resulted in significantly better performance than XML parsers on
XML-formatted Entrez Gene files. We describe and compare the
characteristics and performance of our parsers here.
To retrieve information from Entrez Gene, we need parsers that could
build an easy-to-use data structure from an Entrez Gene record,
from which a calling program could retrieve specific data item(s)
of interest. Tools that could parse text using a context-free
grammar or simulate the process would be very appropriate for this task.
We considered four Perl tools that provide powerful text processing
capabilities ranging from parsing complex and arbitrary data files to
performing natural language processing. We present here not only the
outlines of our parsers using these tools, but also a comparison of the
suitability of each of these tools for practical bioinformatics projects.
Usage and availability of our parsers
Three of the four Entrez Gene parsers we created
utilize context-free grammars. We used an LL-grammar with
Parse::RecDescent (http://search.cpan.org/dist/Parse-RecDescent/),
and specified the same LR-grammar and very similar lexer functions
for Parse::Yapp (http://search.cpan.org/fdesar/Parse-Yapp-1.05/)
and Perl-byacc (http://www.cpan.org/src/misc/). The regular
expression (regex, http://www.perl.com/doc/manual/html/pod/perlre.html)
based parser was implemented using recursive function calls.
During parsing, the parsers will immediately abort when any
offending element is encountered, effectively guaranteeing the accuracy
of the results. The regex-based parser also provides validation and
error reporting capabilities. All four parsers are object-oriented Perl
modules with an instantiation function and a parse function with
option to trim the generated data structure. Programs that use any of
our parsers simply need to include the module, instantiate a parser
object and pass an Entrez Gene record into the parse function of
the object, which then returns the data structure generated. The
details of the grammars we used, the parsers and a sample Perl
program testing them are available from the sourceforge web site
(http://sourceforge.net/projects/egparser/).
Performance comparison
The speeds of the four parsers are all acceptable when parsing
smallto-moderate-sized Entrez Gene records. The parsers created with
Parse::Yapp, Perl-byacc and regex exhibited O(N ) behavior, where
N is the record size, while the Parse::RecDescent-based parser was
about O(N 3) in time based on curve fitting. In fact, it takes the
Parse::RecDescent-based parser nearly 20 min to parse a long Entrez
Gene record (Entrez Gene ID 4539, 846 KB) on one Intel Xeon 2.4
GHz CPU. In sharp contrast, the same entry takes only 0.51 s to be
parsed using our regex-based parser, and about 2.8 and 5.2 s using
the Parse::Yapp and Perl-byacc based parsers, respectively. In all, it
took only 11.5 min for our regex-based parser to process the entire
human Entrez Gene file (145 466 records) on one Intel Xeon 2.4 GHz
CPU. The mouse and rat genomes took 9 and 3.5 min, respectively.
Feature comparison of the Perl tools
With the advance in modern hardware and clusters, software
performance is not necessarily the primary concern for researchers.
Frequently the evaluation of software tools is influenced heavily by
their ease of use, flexibility and debugging capability, among other
aspects. With the experience gained from using these modules, we
provide a short evaluation of each tool below.
Parse::RecDescent This module is the most convenient to useno
need to supply lexer function, very easy to debug and optimize
and provides superior flexibility (it allows parameter passing among
rules, regex terminals, changing grammar during runtime and
context-sensitive grammar, to name just a few). However, it demands
more optimization on the grammar, and even so, still performs
terribly at parsing large input strings.
Parse::Yapp and Perl-byacc These two tool (...truncated)