The Vertebrate Genome Annotation (Vega) database
J. L. Ashurst
0
C.-K. Chen
0
J. G. R. Gilbert
0
K. Jekosch
0
S. Keenan
0
P. Meidl
0
S. M. Searle
0
J. Stalker
0
R. Storey
0
S. Trevanion
0
L. Wilming
0
T. Hubbard
0
0
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus
, Hinxton, Cambridgeshire CB10 1SA,
UK
The Vertebrate Genome Annotation (Vega) database (http://vega.sanger.ac.uk) has been designed to be a community resource for browsing manual annotation of finished sequences from a variety of vertebrate genomes. Its core database is based on an Ensembl-style schema, extended to incorporate curation-specific metadata. In collaboration with the genome sequencing centres, Vega attempts to present consistent high-quality annotation of the published human chromosome sequences. In addition, it is also possible to view various finished regions from other vertebrates, including mouse and zebrafish. Vega displays only manually annotated gene structures built using transcriptional evidence, which can be examined in the browser. Attempts have been made to standardize the annotation procedure across each vertebrate genome, which should aid comparative analysis of orthologues across the different finished regions.
-
In 1999 the DNA sequence of chromosome 22, the first human
chromosome to be fully sequenced, was published (1). It
provided a snapshot of the complexity of genes within a
chromosomal landscape and set the standard for manual annotation,
which the rest of the community was to follow. Yet as
sequencing methods improved and researchers wanted to analyse
unfinished, as well as finished, sequence data, new automated
annotation methods were established and genome browsers
such as Ensembl (2) and the UCSC Genome Browser (3)
provided automatic genome annotation for the draft human
genome assembly finished in 2001 (4). After the
announcement of the finishing of the human genome in 2003, attention
turned to producing a gold standard manually curated view of
the human gene set.
The Vertebrate Genome Annotation (Vega) database is
specifically dedicated to the browsing and maintenance of
manually annotated data. Initially designed to view the manual
annotation produced by the Havana group at the Sanger
Institute (http://www.sanger.ac.uk/HGP/havana/), the project has
expanded to include the manual annotation from the major
centres (including RIKEN, the Joint Genome Institute,
Genoscope and Washington University Genome Sequencing Center)
involved in the sequencing and annotation of the human
genome. Currently, it contains the annotation for 10 human
chromosomes (6, 7, 9, 10, 13, 14, 20, 22, X and Y), but as the public
consortium aims to complete the publication of its analysis by
the end of 2004, it is planned that Vega will contain the
complete manual annotation of the human genome by the beginning
of 2005. Manual annotation is currently more accurate at
identifying splice variants, pseudogenes, polyadenylation [poly(A)]
features, non-coding genes and complex gene arrangements
and clusters than automated methods. At the time of writing,
the Vega human database contains over 15 000 gene loci and
approximately 29 500 transcripts. In addition, Vega contains
manual annotation of other vertebrate species and it is possible
to view small chromosomal regions, e.g. mouse Del36H (5) and
non-contiguous finished clone annotation of zebrafish. Figure 1
represents an overview of the processes and software involved
in producing the data shown in Vega.
GENE CLASSIFICATION AND STANDARDIZATION
OF ANNOTATION
Since different research groups are performing high-quality
manual annotation of different chromosomes, it has been
essential to standardize a set of definitions to describe the
annotation of different gene features. A common factor is
that all annotated gene structures must be supported by
transcriptional evidence, either from cDNA, expressed sequence
tag (EST) or protein sequences. The following are the gene
indices used in human chromosome 20 annotation (6) and
adopted by the Vega database as standard:
(i) Known genes: identical to human cDNA or protein
sequences identified by LocusLink ID in the LocusLink
database (http://www.ncbi.nlm.nih.gov/LocusLink/).
(ii) Novel genes: have an open reading frame (ORF) and are
identical or homologous to known cDNAs (vertebrates)
and/or proteins (all species).
(iii) Novel transcripts: similar to novel genes but no ORF can
be unambiguously assigned.
(iv) Putative genes: homologous to spliced ESTs (vertebrates)
but devoid of significant ORF/CDS.
(v) Pseudogenes: sequences homologous to proteins (over
>50% of the subject length) with a disrupted CDS and
for which an active gene can generally be found at another
locus.
These definitions have also been used in the recent annotation
of chromosome 14 together with an additional classification
predicted genes. Genoscope used this new classification to
describe a gene based on ab initio predictions for which at least
one exon is covered by biological or similarity data (unspliced
ESTs, mouse or Tetraodon genomes or expression data from
Rosetta) (7). These predicted genes as well as putative genes
provide targets for experimental validation (8).
Immunoglobulin segments and pseudogenes found on chromosomes 22
(1) and 14 (7) have also been given unique tags. These
classifications have been extended across all the species in Vega
with the only exception being that the specific model organism
databases, e.g. the Mouse Genome Database (MGD) (http://
www.informatics.jax.org/) (9) and the Zebrafish Information
Network (ZFIN) nomenclature database (http://zfin.org/
zf_info/nomen.html) (10), are used as the point of reference
for known genes in place of LocusLink (11).
Using correct gene nomenclature is an important method for
maintaining consistency in an annotation database, especially
when comparing haplotypes or syntenic regions. The
annotation staff involved in the Vega project, therefore, interact
closely with the nomenclature committees from the Human
Genome Organisation (HUGO, HGNC) (12), ZFIN and MGD.
If an approved symbol is not available for a gene locus, an
interim internal identifier is used, which is usually in the
format clonename.number, e.g. RP11-694B14.5.
The locus and its associated transcripts and exons
are also attributed stable, versioned database IDs
(e.g. OTTHUMG00000017411, OTTHUMT00000046000),
generated and tracked within the Otter database (see Figure
2). Whenever a gene locus is edited the version number will
increase and the date of the change will be saved, allowing the
user to find out when the annotation was last updated. Otter is
an extended Ensembl database with an associated client/server
system that is able to support interactive updating of
annotation (13). The annotation stored in the Otter backend for the
Vega database is either curated directly using Otterlace (a Perl/
TK curation interface wrapped around Acedb) or via Otter
XML uploads, such as from external groups. Multiple versions
of any genome assembly can be s (...truncated)