GeneCards Version 3: the human gene integrator
Marilyn Safran
1
2
Irina Dalah
2
Justin Alexander
2
Naomi Rosen
2
Tsippi Iny Stein
2
Michael Shmoish
0
2
Noam Nativ
2
Iris Bahir
2
Tirza Doniger
2
Hagit Krug
2
Alexandra Sirota-Madi
2
4
Tsviya Olender
2
Yaron Golan
3
Gil Stelzer
2
Arye Harel
2
Doron Lancet
2
0
Bioinformatics Knowledge Unit, Lorry I. Lokey Interdisciplinary Center for Life Sciences and Engineering, Technion - Israel Institute of Technology
, Haifa,
Israel
1
Department of Biological Services, Weizmann Institute of Science
, Rehovot,
Israel
2
Department of Molecular Genetics
3
Xennex Inc,
Cambridge, MA, USA
4
The Sackler School of Medicine, Tel Aviv University
, Tel Aviv,
Israel
GeneCards (www.genecards.org) is a comprehensive, authoritative compendium of annotative information about human genes, widely used for nearly 15 years. Its gene-centric content is automatically mined and integrated from over 80 digital sources, resulting in a web-based deep-linked card for each of >73 000 human gene entries, encompassing the following categories: protein coding, pseudogene, RNA gene, genetic locus, cluster and uncategorized. We now introduce GeneCards Version 3, featuring a speedy and sophisticated search engine and a revamped, technologically enabling infrastructure, catering to the expanding needs of biomedical researchers. A key focus is on gene-set analyses, which leverage GeneCards' unique wealth of combinatorial annotations. These include the GeneALaCart batch query facility, which tabulates user-selected annotations for multiple genes and GeneDecks, which identifies similar genes with shared annotations, and finds set-shared annotations by descriptor enrichment analysis. Such set-centric features address a host of applications, including microarray data analysis, cross-database annotation mapping and gene-disorder associations for drug targeting. We highlight the new Version 3 database architecture, its multi-faceted search engine, and its semi-automated quality assurance system. Data enhancements include an expanded visualization of gene expression patterns in normal and cancer tissues, an integrated alternative splicing pattern display, and augmented multi-source SNPs and pathways sections. GeneCards now provides direct links to gene-related research reagents such as antibodies, recombinant proteins, DNA clones and inhibitory RNAs and features gene-related drugs and compounds lists. We also portray the GeneCards Inferred Functionality Score annotation landscape tool for scoring a gene's functional information status. Finally, we delineate examples of applications and collaborations that have benefited from the GeneCards suite. Database URL: www.genecards.org
Introduction
With the recent accumulation of data from worldwide
genome projects, the individual scientist faces the time
consuming and laborious task of sifting through the expanding
labyrinth of gene information. This can be partly alleviated
by the use of sophisticated integrated and searchable
databases. For many years, GeneCards (www.genecards.org)
(13) has provided such a remedy, with carefully selected,
comprehensive information about human genes, mined
and integrated from over 80 data sources. By bringing
together gene information from large public sources such as
HGNC (4), NCBI (5), ENSEMBL (6) and UniProtKB (7), as well
as many other smaller resources (8), GeneCards has
provided concise genome, proteome, transcriptome, disease
and function data on all known and predicted human
genes. It has successfully overcome barriers of data
format heterogeneity using standard nomenclature,
especially HUGO nomenclature committee approved gene
symbols (4). The information is organized in a card format for
each gene, in distinct functional sections and including a
variety of features such as textual summaries and links to
other genome-wide and specialized databases. GeneCards
has evolved significantly since initially described (1,9,10),
and its progress is documented in a number of past
publications (2,3,1115). In this article, we introduce the new
GeneCards Version 3 (V3) and describe its features in
detail. We place special emphasis on the novel set-centric
capabilities (beyond and in conjunction with the new
GeneCards search engine), which address a variety of
applications, including microarray data analysis, cross-database
annotation mapping and gene-disorder associations for
drug targeting.
Readers who are new to GeneCards might want to read
the Applications section below first, familiarize themselves
with previous articles (13), and then read the rest of this
article, possibly skipping the Methods section.
GeneCards version 3
The new home page
The new GeneCards V3 home page, shown in Figure 1,
hosts the new search facility, provides links to a sample
gene and its various sections on the card via labeled oval
buttons, and enables one to view a variety of differently
categorized and annotated genes, from pre-defined links
as well as by interacting with a random-gene generator,
customizable by category and/or GeneCards Inferred
Functionality Score (GIFtS). The GIFtS algorithm (11) uses
the wealth of GeneCards annotations to produce
annotation scores aimed at predicting the degree of a genes
functionality. Since the degree of known functionality is
correlated with the amount of research done on a
particular gene or its product, these annotation scores are
presented as inferred functionality measures. The extended
GIFtS tool, linked to from the home page, facilitates
browsing the human genome by searching for the annotation
level of a specified gene, retrieving a list of genes within
a specified range of GIFtS values, obtaining random genes
with a specific GIFtS value, and experimenting with the
GIFtS weighting algorithm for a variety of annotation
categories. The left hand side of the home page retains the
logos and links to the GeneCards suites sitesGeneDecks,
GeneALaCart, GeneLoc, GeneNote, GeneAnnot and
GeneTide.
The new search engine
The new version 3 search engine is extremely fast, and is
capable of matching complex field-specific queries of the
entire database in milliseconds. For example, a search for a
very common keyword like cancer returns 8000 results in
3 ms. In contrast, V2 could not handle such a query, or even
a more focused one such as melanoma (too many results
to be efficiently displayed); a considerably more restricted
search in V2 such as schizophrenia yielded 1100 results
and took 80 s. Efficient V3 performance is achieved by
breaking the search process into distinct phases, and also
by returning results in limited pages of data. The two
primary stages of each search are: (i) to first quickly identify
the list of genes that have information matching the search
term, and (ii) upon demand, discover the detailed relevant
context and annotation details of those hits, and highlight
them in minicards (Figure 2). The Methods section details
the design of the new search engine.
The upgraded GeneCards webcard
The card presented for (...truncated)