FastaValidator: an open-source Java library to parse and validate FASTA formatted sequences
Waldmann et al. BMC Research Notes 2014, 7:365
http://www.biomedcentral.com/1756-0500/7/365
T ECHNICA L NOT E
Open Access
FastaValidator: an open-source Java library to
parse and validate FASTA formatted sequences
Jost Waldmann1,2† , Jan Gerken1,2† , Wolfgang Hankeln3† , Timmy Schweer1 and Frank Oliver Glöckner1,2*
Abstract
Background: Advances in sequencing technologies challenge the efficient importing and validation of FASTA
formatted sequence data which is still a prerequisite for most bioinformatic tools and pipelines. Comparative analysis
of commonly used Bio*-frameworks (BioPerl, BioJava and Biopython) shows that their scalability and accuracy is
hampered.
Findings: FastaValidator represents a platform-independent, standardized, light-weight software library written in
the Java programming language. It targets computer scientists and bioinformaticians writing software which needs to
parse quickly and accurately large amounts of sequence data. For end-users FastaValidator includes an interactive
out-of-the-box validation of FASTA formatted files, as well as a non-interactive mode designed for high-throughput
validation in software pipelines.
Conclusions: The accuracy and performance of the FastaValidator library qualifies it for large data sets such as those
commonly produced by massive parallel (NGS) technologies. It offers scientists a fast, accurate and standardized
method for parsing and validating FASTA formatted sequence data.
Keywords: FASTA, Data validation, High-throughput
Findings
Background
The introduction of the first DNA sequencing methods
[1] established the discipline of bioinformatics with
sequences as the primary source of data. With the advent
of massive parallel “Next Generation Sequencing (NGS)”
technologies [2] the speed of sequence production has
now reached petabytes per year. The FASTA format
was introduced alongside with the first algorithms and
tools for biological sequence analysis [3,4]. It defines
how sequences are formatted and exchanged in a simple human-readable layout. Today, the FASTA format is
the de facto standard to exchange sequence data between
bioinformatic tools. Several common frameworks exists
offering FASTA sequence import and validation [5]. Concerning their functionality, many of these frameworks are
rather complex and not designed for high-volume FASTA
*Correspondence:
† Equal contributors
1 Microbial Genomics and Bioinformatics Research Group, Max Planck Institute
for Marine Microbiology, Celsiusstrasse 1, 28359 Bremen, Germany
2 Jacobs University Bremen gGmbH, Campusring 1, 28759 Bremen, Germany
Full list of author information is available at the end of the article
parsing and validation. Another common approach is the
implementation of custom solutions. Often these have
problems recognizing system-specific line endings (Unix,
Microsoft, Apple), invalid characters, or even semantically incorrect data. This leads to serious problems in
data processing up to invalid results. Furthermore, the
focus of bioinformatics has shifted towards (web-based)
pipelines that perform a range of consecutive tasks to analyze sequence data. Therefore, easy integration of FASTA
import and validation functionality into larger software
pipelines or workflows is becoming a common request. To
address issues of parsing, validation, integration, scalability and performance, we present the light-weight, opensource FastaValidator library written in Java, which parses
and validates sequences in FASTA format. The implementation in the platform-independent Java programming
language assures broad usage and easy integration into
bioinformatic software and pipelines. The performance of
the library in comparison to state of the art frameworks
has been evaluated and the ease of integration into web
projects has been demonstrated.
© 2014 Waldmann et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication
waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise
stated.
Waldmann et al. BMC Research Notes 2014, 7:365
http://www.biomedcentral.com/1756-0500/7/365
Page 2 of 4
Implementation
The FastaValidator library implements the IUPAC specifications [6-8] extended by letters necessary to parse
aligned sequences (space, dash, dot, asterisk). Based
on these specifications four parsing modes are implemented: (1) A universal mode that parses and validates
any (multi)FASTA file comprising the nucleotide and
amino acid alphabets. (2) A DNA mode, which parses and
validates only DNA nucleotide sequences. (3) An RNA
mode, which parses and validates only RNA nucleotide
sequences. (4) A Protein mode, which parses and validates only amino acid sequences. To implement the FastaValidator library for high performance, well established
techniques from compiler construction have been used. A
lexical analyzer (lexer) to parse and syntactically validate
the FASTA format was generated using the JFlex scanner generator. The lexer first transforms all characters of
a given FASTA file into syntactically correct tokens. The
parsing mode defines the allowed characters accepted by
the lexer. In a second step the correct semantic order of
these tokens is validated (e.g. the header must be followed
by a comment or sequence). If a FASTA file contains only
Performance tests
Automated evaluation tests were carried out on a standard Desktop-PC (Intel Core i5, 3 GHz, 16 GB RAM)
running the 64 bit server version of Ubuntu Linux 12.04.
For performance comparison all tests were run with
BioJava 3.0.7 (http://biojava.org), Biopython 1.63 (http://
biopython.org) and BioPerl 1.6.9 (http://www.bioperl.
org). The underlying test environments were OpenJDK
1.7.0_25 for FastaValidator and BioJava, Python 2.7.3 and
PyPy 2.2.1 for Biopython and Perl 5.14.2 for BioPerl.
Six different data sets were used as input data: (A)
all protein sequences of Escherichia coli K-12 [9], (B)
the complete genome of Escherichia coli K-12 [9], (C)
all protein sequences of the SWISSPROT database as
B
C
0
0.0
20
0.5
seconds
40
60
seconds
1.0
1.5
80
2.0
seconds
0.0 0.1 0.2 0.3 0.4 0.5 0.6
A
correct tokens in the right order, it is valid. For every token
(end of file (EOF), header-, comment- or sequence line)
an event is generated and lines can be transformed into
user defined data structures. To compile the FastaValidator from the source code Java 1.5 or higher, JFlex 1.4.3 or
higher (http://www.jflex.de) and Ant 1.8 or higher (http://
ant.apache.org) are required.
D
E
F
0
0
0
50
500
500
seconds
1500
seconds
1000
2500
UniProtKB/Swiss−Prot 2013_12
(amino acid, 541,954 entries, 248 MB)
1500
Escherichia coli K−12 ge (...truncated)