GeneLoc: exon-based integration of human genome maps
Vol. 19 Suppl. 1 2003, pages i222–i224
DOI: 10.1093/bioinformatics/btg1030
BIOINFORMATICS
GeneLoc: exon-based integration of human
genome maps
Naomi Rosen, Vered Chalifa-Caspi, Orit Shmueli, Avital Adato,
Michal Lapidot, Julie Stampnitzky, Marilyn Safran ∗ and Doron
Lancet
Weizmann Institute of Science, Rehovot, Israel
Received on January 6, 2003; accepted on February 20, 2003
INTRODUCTION
Whole genome databases on the World Wide Web include:
NCBI, with access to RefSeq, LocusLink, and the Human
Genome MapViewer (Wheeler et al., 2002); the Ensembl
database project, which annotates the genome, integrating
data from other sources with its own predictions (Hubbard
et al., 2002); and the Human Genome Browser at UCSC,
which provides a graphical viewer of the genome with
‘tracks,’ each of which shows different information about
the area in question (Kent et al., 2002). In parallel,
we have developed the Unified Database for Human
Genome Mapping (UDB) (Chalifa-Caspi et al., 1997;
Safran et al., 2003), which sorts genomic objects by
chromosomal location to create an integrated genome map
on a megabase scale with genes, markers, and ESTs.
Although whole-genome mapping resources currently
use NCBI’s assembly and all contain large gene lists,
no two lists are identical. While all of the sources use
prediction programs (Burge and Karlin, 1997; Kulp et al.,
1997; Yeh et al., 2001), different programs and parameters
can produce varying results. Moreover, as in the case of
Ensembl, many model genes with only weak support are
omitted (Hubbard et al., 2002). In contrast, in an effort
∗ To whom correspondence should be addressed.
i222
to provide a comprehensive gene list, NCBI’s LocusLink
contains thousands of model genes, categorized by level
and type of support. Even known genes appearing in every
database may have different names in each database. The
biologist must move among databases to figure out which
genes are the same, and which could be a novel gene
sought. UCSC’s Genome Browser website maps genes
from several sources on the same scale, but the maps are
not integrated, making it difficult to relate genes from
different sources. As stated (Jongeneel, 2000),‘there is an
urgent need for a human gene index that can be used to
identify transcripts unambiguously.’ The author contends
that this index should have, among others, the following
qualities: comprehensiveness, uniqueness, and stability.
We therefore developed GeneLoc, a software adjunct
to GeneCards (Rebhan et al., 1998; Safran et al., 2002)
and UDB that unifies gene collections, eliminates redundancies, and assigns a meaningful location-based identifier to each gene in the index. GeneLoc currently works
with gene sets from NCBI and Ensembl. It aims to compare genes in these collections and decide which should
be unified as one entry and which are discrete. Since the
gene annotations use the same assembly and coordinate
scheme, GeneLoc effects this gene integration by comparing the genes’ genomic locations. The resulting GeneLoc
‘gene territory’ reflects the range of the unified genes, taking into account all possible exons.
ALGORITHM
GeneLoc first obtains genomic information from each
source, including position of each gene and exon, names
and ids, and gene validation status. It then builds two
separate maps. Exon Map, created by comparing positions
of nearby exons, includes all possible exons of each
gene from all sources, regardless of their alternative
combinations in mRNA splice variants. Each group of
overlapping exons, and each single non-overlapping exon,
gets one ‘exon group’ number. Gene Map, which includes
all genes from all sources with their details, is made
by comparing all neighboring and overlapping genes and
c Oxford University Press 2003; all rights reserved.
Bioinformatics 19(1)
ABSTRACT
Motivation: Despite the numerous available wholegenome mapping resources, no comprehensive, integrated map of the human genome yet exists.
Results: GeneLoc, software adjunct to GeneCards and
UDB, integrates gene lists by comparing genomic coordinates at the exon level and assigns unique and meaningful
identifiers to each gene.
Availability: http://bioinfo.weizmann.ac.il/genecards and
http://genecards.weizmann.ac.il/udb
Supplementary information: http://bioinfo.weizmann.ac.
il/cards-bin/AboutGCids.cgi, http://genecards.weizmann.
ac.il/GeneLocAlg.html
Contact:
GeneLoc: exon-based integration of human genome maps
Fig. 1. The relative numbers of genes.
each source to maintain internal consistency. Nevertheless,
the GeneLoc algorithm is currently being extended to
consolidate as many genes resulting from these clusters
as possible (see Supplementary Information). GeneLoc
uniquely features association of genes with others, based
on this overlapping-exon criterion.
Validation of this algorithm by elimination of the ‘match
by symbol’ step showed over 98% success rate for gene
matching (see Supplementary Information).
RESULTS
At this writing, there are 39 155 genes in GeneLoc. Many
were unified from the LocusLink and Ensembl gene lists
(33 845 and 22 980 genes, respectively) by GeneLoc’s
matching algorithm (Fig. 1). There are 15 092 LocusLink
genes not corresponding to any Ensembl genes and 4383
from Ensembl not matching any LocusLink genes. An
additional 1954 genes have overlapping exons with several
other GeneLoc genes. These are part of a total of 6182
genes that GeneLoc identified in 1692 different clusters
(Fig. 2). GeneCards currently has 46 179 entries, including
the GeneLoc genes and 7024 genes from LocusLink with
no associated coordinates.
GeneLoc’s results can be seen in UDB and GeneCards.
Since GeneLoc offers a combined view of genes, markers,
genomic sequences, and their absolute positions, genomic
areas of interest can be viewed from GeneCards via a
gene-based query, or UDB via a mapping- or positionbased query.
DISCUSSION AND CONCLUSION
The combined resources from UDB and GeneCards give
a good picture of genes of interest while providing other
useful details, and complement other resources, such as
NCBI, Ensembl, and UCSC. UDB’s tab-delimited display
of GeneLoc’s results allows viewing of genomic objects in
the context of nearby objects, regardless of strand, object
type, or data source. Moreover, since GeneLoc integrates
genes from several other resources, its gene list is more
thorough than those of other resources (Fig. 3).
UDB/GeneLoc’s ability to display chromosomal regions
between any two genetic markers, and to display even
large genomic areas at a glance, makes it invaluable
for positional cloning. Moreover, by taking advantage
of our GeneNote project (Shmueli et al., 2003), it will
soon be possible to include gene expression information
in GeneLoc. As a result, normal human tissue gene
i223
deciding which gene pairs can be merged. Decisions are
hierarchical—each chromosome (and each unlocalized
contig) is read strand by strand, performing the following
steps: (I) genes are compared to f (...truncated)