The UCSC Known Genes
Fan Hsu
1
W. James Kent
1
Hiram Clawson
1
Robert M. Kuhn
1
Mark Diekhans
1
David Haussler
0
0
Howard Hughes Medical Institute University of California Santa Cruz Santa Cruz
,
CA 95064, USA
1
Center for Biomolecular Science and Engineering, School of Engineering
The University of California Santa Cruz (UCSC) Known Genes dataset is constructed by a fully automated process, based on protein data from Swiss-Prot/TrEMBL (UniProt) and the associated mRNA data from Genbank. The detailed steps of this process are described. Extensive cross-references from this dataset to other genomic and proteomic data were constructed. For each known gene, a details page is provided containing rich information about the gene, together with extensive links to other relevant genomic, proteomic and pathway data. As of July 2005, the UCSC Known Genes are available for human, mouse and rat genomes. The Known Genes serves as a foundation to support several key programs: the Genome Browser, Proteome Browser, Gene Sorter and Table Browser offered at the UCSC website. All the associated data files and program source code are also available. They can be accessed at http://genome.ucsc.edu. The genomic coverage of UCSC Known Genes, RefSeq, Ensembl Genes, H-Invitational and CCDS is analyzed. Although UCSC Known Genes offers the highest genomic and CDS coverage among major human and mouse gene sets, more detailed analysis suggests all of them could be further improved. Contact: The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email:
-
INTRODUCTION
The UCSC Genome Browser (Kent et al., 2002; Karolichik et al.,
2003), which was developed in conjunction with the assembly and
publication of the first Human Draft Genome (International Human
Genome Sequencing Consortium, 2001), has become a popular
website for biomedical communities around the world. The number
of its annotation datasets, or tracks, continues to grow each year.
During the earlier stage of Genome Browser development, there
were only a few annotation tracks in its Genes and Gene Prediction
section. The section included a few gene prediction tracks and a
RefSeq Gene track, each having its own limitations. Different gene
prediction programs often produce different results. The NCBI
RefSeq (Pruitt et al., 2005) offers a high-quality gene set, but
because it is produced by an extensive manual curation process
it has limitations on its coverage and timeliness of availability.
To whom correspondence should be addressed.
In Addition, direct links between RefSeq genes and Swiss-Prot
proteins were not available. Hence we decided to develop an
automated process to construct the UCSC Known Genes dataset based
on the latest protein data from Swiss-Prot/TrEMBL (Bairoch et al.,
2005), now also known as UniProt, and the associated mRNA data
from GenBank (Benson et al., 2005).
While there are various different definitions of what constitutes a
gene, we chose to limit our gene set to protein coding genes and
require each gene be substantiated by at least a transcript (either a
GenBank mRNA or a NCBI RefSeq) and a UniProt protein. We
relied upon UniProts comprehensive cross-references between the
proteins and their associated GenBank mRNAs to build our initial
candidate gene set. Alternative splicing isoforms are included as
different entries, as long as they are represented by a UniProt protein
and a transcript. The initial candidate gene set is further ranked and
processed to select the best representative protein/mRNA for each
gene and duplicates with identical CDS structure removed.
The result of this effort is the UCSC Known Genes: a
comprehensive gene set based mostly upon experimental data. The set can
be built automatically in a relatively short time. Since its first
introduction in early 2003, UCSC Known Genes are now available
for several assembly releases of three major genomes, human,
mouse and rat. As shown in Figure 1, the UCSC Known Genes
dataset has also become a central foundation for key genomic and
proteomic applications, such as the UCSC Genome Browser,
Proteome Browser (Hsu et al., 2005), Gene Sorter (Kent et al.,
2005), and Table Browser (Karolchick et al., 2004), offered at
the UCSC bioinformatics web site, genome.ucsc.edu. Extensive
cross-reference links to other gene-related data available on the
web are also compiled and presented for each Known Gene.
Since the start of our effort, several other gene sets besides
RefSeq from NCBI have become available: Ensembl Genes from
EMBL-EBI, H-Invitational Gene Database (HInv-DB) of JBIRC
and CCDS (the Consensus Coding DNA Sequence) from EBI,
NCBI, UCSC and WTSI. We present comparison between
UCSC Known Genes and other gene sets in the ANALYSIS section.
METHODS
Raw protein data files are downloaded from UniProt and parsed to create a
set of structured relational database tables. A cross-reference table between
protein IDs and GenBank mRNA IDs is created from this UniProt data. The
existing GenBank mRNA sequences are aligned with their corresponding
proteins using BLAT to select the best representative mRNAs for each
protein. The resulting proteinmRNA pairs with their mRNA genomic
alignments and CDS structures are sorted and filtered to remove redundancy and
invalid short sequences. Finally, RefSeq genes having only DNA evidence,
which escaped the above process, are added to form the final results as the
UCSC Known Genes. More details of this process are described in this
section.
A high-level flowchart of the UCSC Known Genes build process is
shown in Figure 2. The process consists of the following four
subprocesses, as depicted in different colors:
build protein databases (green)
get mRNA alignments (red)
select and prune known genes (blue)
add DNA-based RefSeq (magenta)
Protein databases construction
The Swiss-Prot and TrEMBL database flat files are downloaded from
UniProt at ftp://us.expasy.org/databases/uniprot/knowledgebase/. These files
are parsed into 29 relational tables into the database, swissProt.
Two cross-reference tables, spXref2 and spXref3, are created from the
Swiss-Prot/TrEMBL data. The spXref2 table contains rows of the accession
and display IDs of proteins and their external reference databases and the
accession numbers of the external database entries. The spXref3 table
contains rows of accession and display IDs of proteins, their descriptions,
division numbers and HUGO gene symbols and gene descriptions if available.
In addition to Swiss-Prot/TrEMBL and HUGO data, cross-reference data
to other protein databases, e.g. InterPro, Superfamily, NCBI Taxonomy,
pFam and Ensembl are also compiled and stored in the proteins database.
Both the swissProt and proteins databases are used during Known Genes
data set construction and at run-time to support the UCSC Genome Browser,
Proteome Browser, Gene Sorter, and Table Browser.
mRNA alignments
The UCSC Genome Browser has a collection of MySQL genome databases,
one for each gen (...truncated)