An advanced web query interface for biological databases
Mario Latendresse
0
Peter D. Karp
0
0
Bioinformatics Research Group, SRI International
, 333 Ravenswood Avenue, Menlo Park,
CA 94025, USA
Although most web-based biological databases (DBs) offer some type of web-based form to allow users to author DB queries, these query forms are quite restricted in the complexity of DB queries that they can formulate. They can typically query only one DB, and can query only a single type of object at a time (e.g. genes) with no possible interaction between the objectsthat is, in SQL parlance, no joins are allowed between DB objects. Writing precise queries against biological DBs is usually left to a programmer skillful enough in complex DB query languages like SQL. We present a web interface for building precise queries for biological DBs that can construct much more precise queries than most web-based query forms, yet that is user friendly enough to be used by biologists. It supports queries containing multiple conditions, and connecting multiple object types without using the join concept, which is unintuitive to biologists. This interactive web interface is called the Structured Advanced Query Page (SAQP). Users interactively build up a wide range of query constructs. Interactive documentation within the SAQP describes the schema of the queried DBs. The SAQP is based on BioVelo, a query language based on list comprehension. The SAQP is part of the Pathway Tools software and is available as part of several bioinformatics web sites powered by Pathway Tools, including the BioCyc.org site that contains more than 500 Pathway/Genome DBs.
Introduction
Biological databases (DBs) now number in the hundreds,
and are widely viewed as an essential part of post-genomic
molecular biology. However, significant barriers limit
biologists access to biological DBs. Existing easy-to-use
webbased query interfaces to biological DBs severely limit the
complexity of queries that the user can formulate. Users
who want to formulate complicated queries must learn
both a DB query language such as SQL, and a computer
programming language such as C or Java in which to
embed those queries and process the results. Learning
such languages is time consuming at best, and often
presents an insurmountable hurdle for the biologist. One
reason is that the semantics of SQL is based on concepts
not commonly taught to scientists in the course of their
University education (e.g. join).
We present a flexible web interface through which
biologists and bioinformaticists can author precise queries to
biological DBs. The queries that can be written with this
interface, called the Structured Advanced Query Page
(SAQP), can be as precise as what can be expected from a
computer programmer using expressive DB query
languages like SQL. But the web interface can be used without
programming expertise, and use of the SAQP avoids the
types of errors that typically occur when writing computer
programsgreatly reducing the barriers to writing precise
queries.
A precise query is formulated in such a way that it
returns what the user wants without superfluous results. To
enable precision in a query, an appropriate set of relational
operators, as well as direct access to the underlying data,
must be provided. A precise query is unlike an imprecise
query that returns many results that must be disregarded
by the user. A precise query can be simple or complex since
if a user expects all the proteins of a DB, this can be done
using a simple query whereas all proteins that are products
of genes located in specific parts of the genome is a
complex query requiring at least two classes of objects and
several constraints.
A complex query might involve more than one DB class,
might include several DBs and specify many constraints, for
example, find all metabolic pathways containing more
than four reactions for which all enzymes in the pathway
are monomeric and find all biochemical reactions that
convert a carbohydrate to a phosphorylated carbohydrate,
and where the molecular weight of the carbohydrate is less
than 100.
A simple query typically involve searching one DB and
one class, using one or two constraints. Examples include a
query that searches for genes by name, or that searches for
biochemical reactions by EC number, or for chemical
compounds by molecular weight.
The flexibility, ease of use and precision of the SAQP are
due to (i) readable, interactive, expandable form
constructs for specifying search constraints and object
attributes; (ii) the inclusion of variables in searches, which
allow search components to refer to one another; and (iii)
the inclusion of multiple search components within one
query. The notion of readability of precise database
queries is also presented in the doctoral dissertation of
M. Bada (1).
The SAQP web interface is based on a DB query language
newly developed as part of this project called BioVelo,
that is more expressive than SQL, but that has a succinct
syntax and a simpler semantics. In fact, the layout of the
graphical interface of the SAQP is based on the syntax of
BioVelo. We consider BioVelos syntax terse enough to
provide a user interface for direct entry of BioVelo
queries. This interface, called the Free Form Advanced
Query Page (FFAQP), is accessible from the SAQP in one
click.
Figure 1 illustrates the general architecture of our DB
query system. Two Web page interfaces are provided: the
FFAQP and the SAQP. In this article, we focus on the SAQP.
The development of the SAQP was motivated by the
need to provide biologists with the ability to query the
large collection of Pathway/Genome DBs (PGDBs) being
developed by users of the Pathway Tools software, including
the more than 500 PGDBs within SRIs BioCyc collection (2,3)
and the more than 200 PGDBs developed by groups outside
SRI, such as YeastCyc and MouseCyc (4) (see BioCyc.org
for a partial list). Pathway Tools PGDBs are managed using
a Frame Knowledge Representation System called Ocelot
(5). Ocelot is essentially a Common Lisp-based
objectoriented database management system (DBMS) that uses
a relational DBMS as a persistent back end. However, the
relational aspect of Ocelot is invisible to the Ocelot user.
Free Form Advanced
Query Page
Structured Advanced
Query Page
Web Browser
Pathway Tools
Web Server
Figure 1. Two web page interfaces for constructing BioVelo
queries interact with the Pathway Tools web server that
communicates with a BioVelo query processor. Ocelot is an
object-oriented DB system that can use a relational DB back
end. The FFAQP and the SAQP are accessible at the web site
BioCyc.org/query.shtml.
In summary, BioVelo serves as a database query
language, currently built on Ocelot, and the SAQP is a
user-friendly interface built on BioVelo. The BioVelo
and SAQP implementations are applicable to any DB built
using Ocelot, and thus generalize beyond Pathway Tools
DBs such as the BioCyc DBs. Actually, the overall approach
used for the SAQP is applicable to other relatio (...truncated)