The UCSC Known Genes (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/22/9/1036.full.pdf

The UCSC Known Genes

Fan Hsu 1 W. James Kent 1 Hiram Clawson 1 Robert M. Kuhn 1 Mark Diekhans 1 David Haussler 0 0 Howard Hughes Medical Institute University of California Santa Cruz Santa Cruz , CA 95064, USA 1 Center for Biomolecular Science and Engineering, School of Engineering The University of California Santa Cruz (UCSC) Known Genes dataset is constructed by a fully automated process, based on protein data from Swiss-Prot/TrEMBL (UniProt) and the associated mRNA data from Genbank. The detailed steps of this process are described. Extensive cross-references from this dataset to other genomic and proteomic data were constructed. For each known gene, a details page is provided containing rich information about the gene, together with extensive links to other relevant genomic, proteomic and pathway data. As of July 2005, the UCSC Known Genes are available for human, mouse and rat genomes. The Known Genes serves as a foundation to support several key programs: the Genome Browser, Proteome Browser, Gene Sorter and Table Browser offered at the UCSC website. All the associated data files and program source code are also available. They can be accessed at http://genome.ucsc.edu. The genomic coverage of UCSC Known Genes, RefSeq, Ensembl Genes, H-Invitational and CCDS is analyzed. Although UCSC Known Genes offers the highest genomic and CDS coverage among major human and mouse gene sets, more detailed analysis suggests all of them could be further improved. Contact: The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: - INTRODUCTION The UCSC Genome Browser (Kent et al., 2002; Karolichik et al., 2003), which was developed in conjunction with the assembly and publication of the first Human Draft Genome (International Human Genome Sequencing Consortium, 2001), has become a popular website for biomedical communities around the world. The number of its annotation datasets, or tracks, continues to grow each year. During the earlier stage of Genome Browser development, there were only a few annotation tracks in its Genes and Gene Prediction section. The section included a few gene prediction tracks and a RefSeq Gene track, each having its own limitations. Different gene prediction programs often produce different results. The NCBI RefSeq (Pruitt et al., 2005) offers a high-quality gene set, but because it is produced by an extensive manual curation process it has limitations on its coverage and timeliness of availability. To whom correspondence should be addressed. In Addition, direct links between RefSeq genes and Swiss-Prot proteins were not available. Hence we decided to develop an automated process to construct the UCSC Known Genes dataset based on the latest protein data from Swiss-Prot/TrEMBL (Bairoch et al., 2005), now also known as UniProt, and the associated mRNA data from GenBank (Benson et al., 2005). While there are various different definitions of what constitutes a gene, we chose to limit our gene set to protein coding genes and require each gene be substantiated by at least a transcript (either a GenBank mRNA or a NCBI RefSeq) and a UniProt protein. We relied upon UniProts comprehensive cross-references between the proteins and their associated GenBank mRNAs to build our initial candidate gene set. Alternative splicing isoforms are included as different entries, as long as they are represented by a UniProt protein and a transcript. The initial candidate gene set is further ranked and processed to select the best representative protein/mRNA for each gene and duplicates with identical CDS structure removed. The result of this effort is the UCSC Known Genes: a comprehensive gene set based mostly upon experimental data. The set can be built automatically in a relatively short time. Since its first introduction in early 2003, UCSC Known Genes are now available for several assembly releases of three major genomes, human, mouse and rat. As shown in Figure 1, the UCSC Known Genes dataset has also become a central foundation for key genomic and proteomic applications, such as the UCSC Genome Browser, Proteome Browser (Hsu et al., 2005), Gene Sorter (Kent et al., 2005), and Table Browser (Karolchick et al., 2004), offered at the UCSC bioinformatics web site, genome.ucsc.edu. Extensive cross-reference links to other gene-related data available on the web are also compiled and presented for each Known Gene. Since the start of our effort, several other gene sets besides RefSeq from NCBI have become available: Ensembl Genes from EMBL-EBI, H-Invitational Gene Database (HInv-DB) of JBIRC and CCDS (the Consensus Coding DNA Sequence) from EBI, NCBI, UCSC and WTSI. We present comparison between UCSC Known Genes and other gene sets in the ANALYSIS section. METHODS Raw protein data files are downloaded from UniProt and parsed to create a set of structured relational database tables. A cross-reference table between protein IDs and GenBank mRNA IDs is created from this UniProt data. The existing GenBank mRNA sequences are aligned with their corresponding proteins using BLAT to select the best representative mRNAs for each protein. The resulting proteinmRNA pairs with their mRNA genomic alignments and CDS structures are sorted and filtered to remove redundancy and invalid short sequences. Finally, RefSeq genes having only DNA evidence, which escaped the above process, are added to form the final results as the UCSC Known Genes. More details of this process are described in this section. A high-level flowchart of the UCSC Known Genes build process is shown in Figure 2. The process consists of the following four subprocesses, as depicted in different colors: build protein databases (green) get mRNA alignments (red) select and prune known genes (blue) add DNA-based RefSeq (magenta) Protein databases construction The Swiss-Prot and TrEMBL database flat files are downloaded from UniProt at ftp://us.expasy.org/databases/uniprot/knowledgebase/. These files are parsed into 29 relational tables into the database, swissProt. Two cross-reference tables, spXref2 and spXref3, are created from the Swiss-Prot/TrEMBL data. The spXref2 table contains rows of the accession and display IDs of proteins and their external reference databases and the accession numbers of the external database entries. The spXref3 table contains rows of accession and display IDs of proteins, their descriptions, division numbers and HUGO gene symbols and gene descriptions if available. In addition to Swiss-Prot/TrEMBL and HUGO data, cross-reference data to other protein databases, e.g. InterPro, Superfamily, NCBI Taxonomy, pFam and Ensembl are also compiled and stored in the proteins database. Both the swissProt and proteins databases are used during Known Genes data set construction and at run-time to support the UCSC Genome Browser, Proteome Browser, Gene Sorter, and Table Browser. mRNA alignments The UCSC Genome Browser has a collection of MySQL genome databases, one for each gen (...truncated)