GeneLoc: exon-based integration of human genome maps

Jul 2003

Motivation:Despite the numerous available whole-genome mapping resources, no comprehensive, integrated map of the human genome yet exists. Results: GeneLoc, software adjunct to GeneCards and UDB, integrates gene lists by comparing genomic coordinates at the exon level and assigns unique and meaningful identifiers to each gene. Availability: http://bioinfo.weizmann.ac.il/genecards and http://genecards.weizmann.ac.il/udb Supplementary information: http://bioinfo.weizmann.ac.il/cards-bin/AboutGCids.cgi, http://genecards.weizmann.ac.il/GeneLocAlg.html Contact: marilyn.safran{at}weizmann.ac.il

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/19/suppl_1/i222.full.pdf

GeneLoc: exon-based integration of human genome maps

Vol. 19 Suppl. 1 2003, pages i222–i224 DOI: 10.1093/bioinformatics/btg1030 BIOINFORMATICS GeneLoc: exon-based integration of human genome maps Naomi Rosen, Vered Chalifa-Caspi, Orit Shmueli, Avital Adato, Michal Lapidot, Julie Stampnitzky, Marilyn Safran ∗ and Doron Lancet Weizmann Institute of Science, Rehovot, Israel Received on January 6, 2003; accepted on February 20, 2003 INTRODUCTION Whole genome databases on the World Wide Web include: NCBI, with access to RefSeq, LocusLink, and the Human Genome MapViewer (Wheeler et al., 2002); the Ensembl database project, which annotates the genome, integrating data from other sources with its own predictions (Hubbard et al., 2002); and the Human Genome Browser at UCSC, which provides a graphical viewer of the genome with ‘tracks,’ each of which shows different information about the area in question (Kent et al., 2002). In parallel, we have developed the Unified Database for Human Genome Mapping (UDB) (Chalifa-Caspi et al., 1997; Safran et al., 2003), which sorts genomic objects by chromosomal location to create an integrated genome map on a megabase scale with genes, markers, and ESTs. Although whole-genome mapping resources currently use NCBI’s assembly and all contain large gene lists, no two lists are identical. While all of the sources use prediction programs (Burge and Karlin, 1997; Kulp et al., 1997; Yeh et al., 2001), different programs and parameters can produce varying results. Moreover, as in the case of Ensembl, many model genes with only weak support are omitted (Hubbard et al., 2002). In contrast, in an effort ∗ To whom correspondence should be addressed. i222 to provide a comprehensive gene list, NCBI’s LocusLink contains thousands of model genes, categorized by level and type of support. Even known genes appearing in every database may have different names in each database. The biologist must move among databases to figure out which genes are the same, and which could be a novel gene sought. UCSC’s Genome Browser website maps genes from several sources on the same scale, but the maps are not integrated, making it difficult to relate genes from different sources. As stated (Jongeneel, 2000),‘there is an urgent need for a human gene index that can be used to identify transcripts unambiguously.’ The author contends that this index should have, among others, the following qualities: comprehensiveness, uniqueness, and stability. We therefore developed GeneLoc, a software adjunct to GeneCards (Rebhan et al., 1998; Safran et al., 2002) and UDB that unifies gene collections, eliminates redundancies, and assigns a meaningful location-based identifier to each gene in the index. GeneLoc currently works with gene sets from NCBI and Ensembl. It aims to compare genes in these collections and decide which should be unified as one entry and which are discrete. Since the gene annotations use the same assembly and coordinate scheme, GeneLoc effects this gene integration by comparing the genes’ genomic locations. The resulting GeneLoc ‘gene territory’ reflects the range of the unified genes, taking into account all possible exons. ALGORITHM GeneLoc first obtains genomic information from each source, including position of each gene and exon, names and ids, and gene validation status. It then builds two separate maps. Exon Map, created by comparing positions of nearby exons, includes all possible exons of each gene from all sources, regardless of their alternative combinations in mRNA splice variants. Each group of overlapping exons, and each single non-overlapping exon, gets one ‘exon group’ number. Gene Map, which includes all genes from all sources with their details, is made by comparing all neighboring and overlapping genes and c Oxford University Press 2003; all rights reserved. Bioinformatics 19(1)  ABSTRACT Motivation: Despite the numerous available wholegenome mapping resources, no comprehensive, integrated map of the human genome yet exists. Results: GeneLoc, software adjunct to GeneCards and UDB, integrates gene lists by comparing genomic coordinates at the exon level and assigns unique and meaningful identifiers to each gene. Availability: http://bioinfo.weizmann.ac.il/genecards and http://genecards.weizmann.ac.il/udb Supplementary information: http://bioinfo.weizmann.ac. il/cards-bin/AboutGCids.cgi, http://genecards.weizmann. ac.il/GeneLocAlg.html Contact: GeneLoc: exon-based integration of human genome maps Fig. 1. The relative numbers of genes. each source to maintain internal consistency. Nevertheless, the GeneLoc algorithm is currently being extended to consolidate as many genes resulting from these clusters as possible (see Supplementary Information). GeneLoc uniquely features association of genes with others, based on this overlapping-exon criterion. Validation of this algorithm by elimination of the ‘match by symbol’ step showed over 98% success rate for gene matching (see Supplementary Information). RESULTS At this writing, there are 39 155 genes in GeneLoc. Many were unified from the LocusLink and Ensembl gene lists (33 845 and 22 980 genes, respectively) by GeneLoc’s matching algorithm (Fig. 1). There are 15 092 LocusLink genes not corresponding to any Ensembl genes and 4383 from Ensembl not matching any LocusLink genes. An additional 1954 genes have overlapping exons with several other GeneLoc genes. These are part of a total of 6182 genes that GeneLoc identified in 1692 different clusters (Fig. 2). GeneCards currently has 46 179 entries, including the GeneLoc genes and 7024 genes from LocusLink with no associated coordinates. GeneLoc’s results can be seen in UDB and GeneCards. Since GeneLoc offers a combined view of genes, markers, genomic sequences, and their absolute positions, genomic areas of interest can be viewed from GeneCards via a gene-based query, or UDB via a mapping- or positionbased query. DISCUSSION AND CONCLUSION The combined resources from UDB and GeneCards give a good picture of genes of interest while providing other useful details, and complement other resources, such as NCBI, Ensembl, and UCSC. UDB’s tab-delimited display of GeneLoc’s results allows viewing of genomic objects in the context of nearby objects, regardless of strand, object type, or data source. Moreover, since GeneLoc integrates genes from several other resources, its gene list is more thorough than those of other resources (Fig. 3). UDB/GeneLoc’s ability to display chromosomal regions between any two genetic markers, and to display even large genomic areas at a glance, makes it invaluable for positional cloning. Moreover, by taking advantage of our GeneNote project (Shmueli et al., 2003), it will soon be possible to include gene expression information in GeneLoc. As a result, normal human tissue gene i223 deciding which gene pairs can be merged. Decisions are hierarchical—each chromosome (and each unlocalized contig) is read strand by strand, performing the following steps: (I) genes are compared to f (...truncated)


This is a preview of a remote PDF: https://bioinformatics.oxfordjournals.org/content/19/suppl_1/i222.full.pdf
Article home page: http://bioinformatics.oxfordjournals.org/content/19/suppl_1/i222.abstract

Naomi Rosen, Vered Chalifa-Caspi, Orit Shmueli, Avital Adato, Michal Lapidot, Julie Stampnitzky, Marilyn Safran, Doron Lancet. GeneLoc: exon-based integration of human genome maps, 2003, pp. i222-i224, 19/suppl 1, DOI: 10.1093/bioinformatics/btg1030