The Vertebrate Genome Annotation (Vega) database (pdf)

Article PDF cannot be displayed. You can download it here:

https://nar.oxfordjournals.org/content/33/suppl_1/D459.full.pdf

The Vertebrate Genome Annotation (Vega) database

J. L. Ashurst 0 C.-K. Chen 0 J. G. R. Gilbert 0 K. Jekosch 0 S. Keenan 0 P. Meidl 0 S. M. Searle 0 J. Stalker 0 R. Storey 0 S. Trevanion 0 L. Wilming 0 T. Hubbard 0 0 Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus , Hinxton, Cambridgeshire CB10 1SA, UK The Vertebrate Genome Annotation (Vega) database (http://vega.sanger.ac.uk) has been designed to be a community resource for browsing manual annotation of finished sequences from a variety of vertebrate genomes. Its core database is based on an Ensembl-style schema, extended to incorporate curation-specific metadata. In collaboration with the genome sequencing centres, Vega attempts to present consistent high-quality annotation of the published human chromosome sequences. In addition, it is also possible to view various finished regions from other vertebrates, including mouse and zebrafish. Vega displays only manually annotated gene structures built using transcriptional evidence, which can be examined in the browser. Attempts have been made to standardize the annotation procedure across each vertebrate genome, which should aid comparative analysis of orthologues across the different finished regions. - In 1999 the DNA sequence of chromosome 22, the first human chromosome to be fully sequenced, was published (1). It provided a snapshot of the complexity of genes within a chromosomal landscape and set the standard for manual annotation, which the rest of the community was to follow. Yet as sequencing methods improved and researchers wanted to analyse unfinished, as well as finished, sequence data, new automated annotation methods were established and genome browsers such as Ensembl (2) and the UCSC Genome Browser (3) provided automatic genome annotation for the draft human genome assembly finished in 2001 (4). After the announcement of the finishing of the human genome in 2003, attention turned to producing a gold standard manually curated view of the human gene set. The Vertebrate Genome Annotation (Vega) database is specifically dedicated to the browsing and maintenance of manually annotated data. Initially designed to view the manual annotation produced by the Havana group at the Sanger Institute (http://www.sanger.ac.uk/HGP/havana/), the project has expanded to include the manual annotation from the major centres (including RIKEN, the Joint Genome Institute, Genoscope and Washington University Genome Sequencing Center) involved in the sequencing and annotation of the human genome. Currently, it contains the annotation for 10 human chromosomes (6, 7, 9, 10, 13, 14, 20, 22, X and Y), but as the public consortium aims to complete the publication of its analysis by the end of 2004, it is planned that Vega will contain the complete manual annotation of the human genome by the beginning of 2005. Manual annotation is currently more accurate at identifying splice variants, pseudogenes, polyadenylation [poly(A)] features, non-coding genes and complex gene arrangements and clusters than automated methods. At the time of writing, the Vega human database contains over 15 000 gene loci and approximately 29 500 transcripts. In addition, Vega contains manual annotation of other vertebrate species and it is possible to view small chromosomal regions, e.g. mouse Del36H (5) and non-contiguous finished clone annotation of zebrafish. Figure 1 represents an overview of the processes and software involved in producing the data shown in Vega. GENE CLASSIFICATION AND STANDARDIZATION OF ANNOTATION Since different research groups are performing high-quality manual annotation of different chromosomes, it has been essential to standardize a set of definitions to describe the annotation of different gene features. A common factor is that all annotated gene structures must be supported by transcriptional evidence, either from cDNA, expressed sequence tag (EST) or protein sequences. The following are the gene indices used in human chromosome 20 annotation (6) and adopted by the Vega database as standard: (i) Known genes: identical to human cDNA or protein sequences identified by LocusLink ID in the LocusLink database (http://www.ncbi.nlm.nih.gov/LocusLink/). (ii) Novel genes: have an open reading frame (ORF) and are identical or homologous to known cDNAs (vertebrates) and/or proteins (all species). (iii) Novel transcripts: similar to novel genes but no ORF can be unambiguously assigned. (iv) Putative genes: homologous to spliced ESTs (vertebrates) but devoid of significant ORF/CDS. (v) Pseudogenes: sequences homologous to proteins (over >50% of the subject length) with a disrupted CDS and for which an active gene can generally be found at another locus. These definitions have also been used in the recent annotation of chromosome 14 together with an additional classification predicted genes. Genoscope used this new classification to describe a gene based on ab initio predictions for which at least one exon is covered by biological or similarity data (unspliced ESTs, mouse or Tetraodon genomes or expression data from Rosetta) (7). These predicted genes as well as putative genes provide targets for experimental validation (8). Immunoglobulin segments and pseudogenes found on chromosomes 22 (1) and 14 (7) have also been given unique tags. These classifications have been extended across all the species in Vega with the only exception being that the specific model organism databases, e.g. the Mouse Genome Database (MGD) (http:// www.informatics.jax.org/) (9) and the Zebrafish Information Network (ZFIN) nomenclature database (http://zfin.org/ zf_info/nomen.html) (10), are used as the point of reference for known genes in place of LocusLink (11). Using correct gene nomenclature is an important method for maintaining consistency in an annotation database, especially when comparing haplotypes or syntenic regions. The annotation staff involved in the Vega project, therefore, interact closely with the nomenclature committees from the Human Genome Organisation (HUGO, HGNC) (12), ZFIN and MGD. If an approved symbol is not available for a gene locus, an interim internal identifier is used, which is usually in the format clonename.number, e.g. RP11-694B14.5. The locus and its associated transcripts and exons are also attributed stable, versioned database IDs (e.g. OTTHUMG00000017411, OTTHUMT00000046000), generated and tracked within the Otter database (see Figure 2). Whenever a gene locus is edited the version number will increase and the date of the change will be saved, allowing the user to find out when the annotation was last updated. Otter is an extended Ensembl database with an associated client/server system that is able to support interactive updating of annotation (13). The annotation stored in the Otter backend for the Vega database is either curated directly using Otterlace (a Perl/ TK curation interface wrapped around Acedb) or via Otter XML uploads, such as from external groups. Multiple versions of any genome assembly can be s (...truncated)