BioMart – biological queries made easy
BMC Genomics
Software BioMart - biological queries made easy
Damian Smedley 2 4
Syed Haider 2 4
Benoit Ballester 2 4
Richard Holland 2 4
Darin London 1 4
Gudmundur Thorisson 0 4
Arek Kasprzyk 3 4
0 Department of Genetics, University of Leicester , University Road, Leicester, LE1 7RH , UK
1 Institute for Genome Sciences & Policy (IGSP), Duke University CIEMAS , 101 Science Drive, DUMC Box 3382, Durham, NC 27708 , USA
2 European Bioinformatics Institute, Wellcome Trust Genome Campus , Hinxton, Cambridge, CB10 1SD , UK
3 Ontario Institute for Cancer Research, MaRS Centre , South Tower, 101 College Street, Suite 800 Toronto, Ontario, M5G 0A3 , Canada
4 CSHL, USA CSHL, USA CSHL, USA Northwestern University, USA Medical College of Wisconsin, USA EMBL-EBI, UK EMBL-EBI, UK EMBL-EBI, UK EMBL-EBI, UK Barts & The London School of Medicine, UK University of Manchester , UK EMBL-EBI, UK Biozentrum/SIB , Switzerland
Background: Biologists need to perform complex queries, often across a variety of databases. Typically, each data resource provides an advanced query interface, each of which must be learnt by the biologist before they can begin to query them. Frequently, more than one data source is required and for high-throughput analysis, cutting and pasting results between websites is certainly very time consuming. Therefore, many groups rely on local bioinformatics support to process queries by accessing the resource's programmatic interfaces if they exist. This is not an efficient solution in terms of cost and time. Instead, it would be better if the biologist only had to learn one generic interface. BioMart provides such a solution. Results: BioMart enables scientists to perform advanced querying of biological data sources through a single web interface. The power of the system comes from integrated querying of data sources regardless of their geographical locations. Once these queries have been defined, they may be automated with its "scripting at the click of a button" functionality. BioMart's capabilities are extended by integration with several widely used software packages such as BioConductor, DAS, Galaxy, Cytoscape, Taverna. In this paper, we describe all aspects of BioMart from a user's perspective and demonstrate how it can be used to solve real biological use cases such as SNP selection for candidate gene screening or annotation of microarray results. Conclusion: BioMart is an easy to use, generic and scalable system and therefore, has become an integral part of large data resources including Ensembl, UniProt, HapMap, Wormbase, Gramene, Dictybase, PRIDE, MSD and Reactome. BioMart is freely accessible to use at http:// www.biomart.org.
-
Background
In this post-genomics era, data of increasing volume and
complexity is being deposited into databases around the
world. Biologists need to ask complex queries of this data
to test and drive their research hypotheses. Typically, each
data source provides an advanced query interface on their
website to satisfy this requirement. However, each site has
its own solution and subsequently, the user has a learning
curve before they can start interacting with the data. A
further problem the researcher has is that they often need to
query more than one data source, necessitating mastering
more than one interface and having to cut and paste
results between the sites. If the analysis involves
highthroughput data, this approach is not usually scalable. To
overcome this problem, many groups rely on
bioinformaticians who can generate scripts to interact with the
varying programmatic interfaces of the different data sources.
They also often have to learn a number of different web
services or application programmatic interfaces (APIs) for
each resource. A preferable solution would be to have
generic software that a biologist can use on top of any data
source. BioMart[1] is such a solution.
BioMart is an open source data management system that
comes with a range of query interfaces that allow users to
group and refine data based upon many different criteria.
In addition, the software features a built-in query
optimiser for fast data retrieval. A BioMart installation can
provide domain-specific querying of a single data source
or function as a one-stop shop (web portal) to a wide
range of BioMarts as our central portal [2] does. All
BioMart websites have the same look and feel (only
varying in colour scheme and branding), which has obvious
advantages to users moving between different resources.
However, the power of the system comes from integrated
querying of the different BioMarts. If any datasets share
common identifiers (such as Ensembl gene IDs or
Uniprot IDs) or even mappings to a common genome
assembly, these can be used to link BioMarts together in
integrated queries. Additionally, these datasets do not
have to be located on the same server or even at the same
geographical location. This distributed solution has many
advantages; not least of which is the fact that each site can
utilise their own domain expertise to deploy their
BioMart.
BioMart also has the advantage of being integrated with
external software packages such as BioConductor [3], the
Distributed Annotation System (DAS) [4], Galaxy [5],
Cytoscape [6], Taverna [7]. This enables users to perform
integrated queries with non-BioMart data sources as well
as detailed analysis of the results. BioMart is also part of
the GMOD (Generic Model Organism Database) [8] suite
of tools for building a model organism site.
Originally developed for the Ensembl genome browser [9]
as the EnsMart data warehouse [10], BioMart has now
become a fully generic data integration solution.
Although applicable to any type of data, BioMart is
particularly suited for advanced searching of the complex
descriptive data typically found in biological datasets.
Numerous BioMarts have now been installed by external
groups, in large part because of its automated deployment
tools and cross platform compatibility. These include
model organism databases such as Gramene [11],
Dictybase [12], Wormbase [13] and RGD (Rat Genome
Database) [14] as well as HapMap variation [15], pancreatic
expression database [16], Reactome pathways [17] and
PRIDE proteomic [18] databases (see Table 1 for the full
list). A wide variety of analyses and tasks are possible from
the publicly available BioMarts, ranging from SNP (single
nucleotide polymorphism) selection for candidate gene
screening, microarray annotation, cross-species analysis,
through to recovery of disease links, sequence variations
and expression patterns.
The range of interfaces is designed with both biologists
and bioinformaticians in mind. The simplest way of
querying BioMart is via the web interface called MartView
(either on our central portal [2] or follow the links on our
main page [1] to the individual sites). Programmatic
access is available via a Perl API or BioMart's web services
(MartServices). An important and novel feature of
BioMart is th (...truncated)