Fast parsers for Entrez Gene (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/21/14/3189.full.pdf

Fast parsers for Entrez Gene

Fast parsers for Entrez Gene Mingyi Liu 0 Andrei Grigoriev 0 0 GPC Biotech AG, Fraunhoferstrasse 20, 82152 Martinsried, Germany Summary: NCBI completed the transition of its main genome annotation database from Locuslink to Entrez Gene in Spring 2005. However, to this date few parsers exist for the Entrez Gene annotation file. Owing to the widespread use of Locuslink and the popularity of Perl programming language in bioinformatics, a publicly available high performance Entrez Gene parser in Perl is urgently needed. We present four such parsers that were developed using several parsing approaches (Parse::RecDescent, Parse::Yapp, Perl-byacc and Perl 5 regular expressions) and provide the first in-depth comparison of these sophisticated Perl tools. Our fastest parser processes the entire human Entrez Gene annotation file in under 12 min on one Intel Xeon 2.4 GHz CPU and can be of help to the bioinformatics community during and after the transition from Locuslink to Entrez Gene. Availability: Source codes are available under the Perl and GNU public license at http://sourceforge.net/projects/egparser/ Contact: The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: - INTRODUCTION The National Center for Biotechnology Information (NCBI) completed the transition from Locuslink (Pruitt and Maglott, 2001) to Entrez Gene (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db= gene) (Maglott et al., 2005) in Spring 2005. Thus, there is an urgent need for parsers for the ASN.1-formatted (http://www.ncbi.nlm.nih. gov/Sitemap/Summary/asn1.html) Entrez Gene annotation file. However, despite the immense popularity of Perl programming language among bioinformatics researchers, there is currently no publicly available Perl parser for either the Entrez Gene annotation file or ASN.1 text files in general. The NCBI Entrez Gene parser has a rather steep learning curve and is available only in the C/C++-based toolbox (http://ncbi.nih.gov/IEB/ToolBox/index.cgi/). The very latest gene2xml tool from NCBI provides Perl users an indirect way to process Entrez Gene data as it could convert the binary ASN-formatted Entrez Gene files to XML format. However, the Entrez Gene XML format (http://www.ncbi.nlm.nih.gov/dtd/) is rather complex and difficult to use. Its storage and processing also consume significant computational resources. In contrast, an object-oriented Perl parser of the ASN-formatted Entrez Gene files would be efficient, easy to use and could interface well with the vast number of public bioinformatics tools, e.g. Bioperl (http://bioperl.org/) (Stajich et al., 2002) and EnsEMBL (http://www.ensembl.org) (Birney et al., 2004), and any in-house tools developed in Perl. Compared with C/C++-based tools, pure Perl parsers support all the operating systems that Perl has a virtual machine for and require no effort of porting. We also placed a very strong emphasis on performance optimizations when creating our Perl parsers using four different approaches, which resulted in significantly better performance than XML parsers on XML-formatted Entrez Gene files. We describe and compare the characteristics and performance of our parsers here. To retrieve information from Entrez Gene, we need parsers that could build an easy-to-use data structure from an Entrez Gene record, from which a calling program could retrieve specific data item(s) of interest. Tools that could parse text using a context-free grammar or simulate the process would be very appropriate for this task. We considered four Perl tools that provide powerful text processing capabilities ranging from parsing complex and arbitrary data files to performing natural language processing. We present here not only the outlines of our parsers using these tools, but also a comparison of the suitability of each of these tools for practical bioinformatics projects. Usage and availability of our parsers Three of the four Entrez Gene parsers we created utilize context-free grammars. We used an LL-grammar with Parse::RecDescent (http://search.cpan.org/dist/Parse-RecDescent/), and specified the same LR-grammar and very similar lexer functions for Parse::Yapp (http://search.cpan.org/fdesar/Parse-Yapp-1.05/) and Perl-byacc (http://www.cpan.org/src/misc/). The regular expression (regex, http://www.perl.com/doc/manual/html/pod/perlre.html) based parser was implemented using recursive function calls. During parsing, the parsers will immediately abort when any offending element is encountered, effectively guaranteeing the accuracy of the results. The regex-based parser also provides validation and error reporting capabilities. All four parsers are object-oriented Perl modules with an instantiation function and a parse function with option to trim the generated data structure. Programs that use any of our parsers simply need to include the module, instantiate a parser object and pass an Entrez Gene record into the parse function of the object, which then returns the data structure generated. The details of the grammars we used, the parsers and a sample Perl program testing them are available from the sourceforge web site (http://sourceforge.net/projects/egparser/). Performance comparison The speeds of the four parsers are all acceptable when parsing smallto-moderate-sized Entrez Gene records. The parsers created with Parse::Yapp, Perl-byacc and regex exhibited O(N ) behavior, where N is the record size, while the Parse::RecDescent-based parser was about O(N 3) in time based on curve fitting. In fact, it takes the Parse::RecDescent-based parser nearly 20 min to parse a long Entrez Gene record (Entrez Gene ID 4539, 846 KB) on one Intel Xeon 2.4 GHz CPU. In sharp contrast, the same entry takes only 0.51 s to be parsed using our regex-based parser, and about 2.8 and 5.2 s using the Parse::Yapp and Perl-byacc based parsers, respectively. In all, it took only 11.5 min for our regex-based parser to process the entire human Entrez Gene file (145 466 records) on one Intel Xeon 2.4 GHz CPU. The mouse and rat genomes took 9 and 3.5 min, respectively. Feature comparison of the Perl tools With the advance in modern hardware and clusters, software performance is not necessarily the primary concern for researchers. Frequently the evaluation of software tools is influenced heavily by their ease of use, flexibility and debugging capability, among other aspects. With the experience gained from using these modules, we provide a short evaluation of each tool below. Parse::RecDescent This module is the most convenient to useno need to supply lexer function, very easy to debug and optimize and provides superior flexibility (it allows parameter passing among rules, regex terminals, changing grammar during runtime and context-sensitive grammar, to name just a few). However, it demands more optimization on the grammar, and even so, still performs terribly at parsing large input strings. Parse::Yapp and Perl-byacc These two tool (...truncated)