WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction
Ross Overbeek
1
Niels Larsen
1
Gordon D. Pusch
1
Mark D'Souza
0
1
Evgeni Selkov Jr
0
1
Nikos Kyrpides
1
Michael Fonstein
1
Natalia Maltsev
0
Evgeni Selkov
0
1
0
Mathematics and Computer Science Division, Argonne National Laboratory
, Argonne,
IL 60439, USA
1
Integrated Genomics Inc.
, 2201 W. Campbell Park Drive,
Chicago, IL 60612, USA
The WIT (What Is There) (http://wit.mcs.anl.gov/WIT2/ ) system has been designed to support comparative analysis of sequenced genomes and to generate metabolic reconstructions based on chromosomal sequences and metabolic modules from the EMP/MPW family of databases. This system contains data derived from about 40 completed or nearly completed genomes. Sequence homologies, various ORFclustering algorithms, relative gene positions on the chromosome and placement of gene products in metabolic pathways (metabolic reconstruction) can be used for the assignment of gene functions and for development of overviews of genomes within WIT. The integration of a large number of phylogenetically diverse genomes in WIT facilitates the understanding of the physiology of different organisms.
-
Starting with Haemophilus influenza (1) in 1995, over 20
microbial organisms have had their total genomic DNA
sequenced and almost 100 others have been started as shown in
the GOLD database (2). Currently we are observing an impressive
development of the human genome project (3,4). In response
to this growing amount of sequence data, computational tools
for genome analysis have been developed and merged into
shared analytical environments, such as GeneQuiz (5), KEGG
(6), Pedant (7) and Entrez Genomes (8), moving cross-genome
analysis to a new level. The development of analytical
systems, together with the growth of sequencing data, have
increased gene recognition rates from <50% (9,10) to >70%
(11,12). Today, this remaining 30%, so-called hypothetical
or orphan genes, separates us from a complete description of
the genomic content and functions of an organism.
Computational approaches based on various types of clustering
of potential genes, whether in phylogenetic space, as clusters of
orthologous genes (COGs) (13) or position on the chromosome,
such as in operons (14), increase the gene assignment level
even further. An important stage of genome analysis is the
integration of gene assignments into an organism-specific
overview via so-called functional reconstruction (15), which is
the conceptual assembly of metabolic pathways, transport
units and signal transduction pathways. It allows reconciliation
of inconsistencies between different types of analysis, and
often results in changes of initial gene function assignments
based on similarity scoring.
The WIT system, discussed in this paper, represents the
development of a genome analysis strategy in a multi-genome
environment, which combines a variety of tools, dealing with
individual open reading frames (ORFs) or proteins, with the
ability to derive general conclusions. Using the WIT genome
analysis system, a major part of the central metabolism of an
organism can be reconstructed entirely in silico (16).
WIT: A VIEW TO A GENOME
The current version of the WIT system is available at Argonne
National Laboratory (http://wit.mcs.anl.gov/WIT2 ) or at Integrated
Genomics Inc. (http://wit.IntegratedGenomics.com/IGwit ) and
contains 43 complete or nearly complete genomes (Table 1).
These genomes consist of 123 482 predicted ORFs, of which
78 144 could be given functional assignments and 41 742
could be assembled into metabolic pathways, which came from
EMP/MPW database (15). Pathways involved in the metabolism
of carbohydrates and amino acids are connected into schematic
overviews allowing the user to reveal substrates and final products
connecting metabolic modules.
In order to incorporate a genome into WIT, a gene-searching
program called CRITICA (17) can be used. Potential coding
regions recognized in the DNA contigs are subjected to a
FASTA search against the non-redundant database of assigned
genes and loaded into the WIT system, together with the
precomputed tables of best hits.
WIT provides a set of tools for the characterization of gene
structures and functions, such as Functional Coupling, or
Preserved Operons. WIT also provides integrated WWW
access to such tools as PSI-BLAST, PROSITE, ProDom,
COG, ClustalW and others. Functional content may be
queried, for example, by looking for specific functions missing
in the metabolic pathways, or by separating alternative gene
functions derived from similarities found for a putative gene.
After genes have been assigned initial functions, they are
then attached to pathways by choosing templates from metabolic
database (MPW) which best incorporate all observed functions. For
any given organism, this usually leads to identification of
Saccharomyces cerevisiae, Caenorhabditis elegans
Sulfolobus solfataricus, Archaeoglobus fulgidus,Halobacterium sp., M.thermoautotrophicum, M.jannaschii, Pyrococcus
furiosus, Pyrococcus horikoshii
A.aeolicus, C.trachomatis, Synechocystis sp., P.gingivalis, M.leprae, M.tuberculosis, B.subtilis, C.acetobutylicum,
E.faecalis, M.genitalium, M.pneumoniae, S.pneumoniae, S.pyogenes, Rhizobium sp., R.capsulatus, S.aromaticivorans,
N.gonorrhoeae, N.meningitidis, C.jejuni, H.pylori, E.coli, Y.pestis, H.influenzae, P.aeruginosa, B.burgdorferi, T.pallidum,
D.radiodurans
Additional Genomes on the public
server at Integrated Genomics Inc.
functional sub-systems, as a model for further refinement. For
example, it is now possible to identify inconsistencies, potentially
missing enzymes/ORFs, thereby refining the model. When a
basic model has been created, a curator finally evaluates this
model against biochemical data and phenotypes known from
the literature. The models come in both textual and graphical
representations, fully linked with all underlying data. We call
this whole process metabolic reconstruction, and the main role
of the WIT system is to support this effort.
To examine or curate a functional model of an organism, one
can use functions such as: Compare assignments, Summary of
asserted functions and pathways, Examine trimmed ortholog
clusters, Examine COG/trimmed ortholog cluster relationships,
Search for pathways by regular expression, Search ORF functions
by regular expression, Search ORF sequences by similarity
search, Find NCBIs MEDLINE-references by EC-number,
Search EMP by EC-number, and Find common proteins for
organisms. Chromosomal clustering of functionally related
genes (14) is another powerful component of the system,
which recently allowed us to propose a number of candidate
ORFs for orphan metabolic functions. Continuous integration of
newly sequenced genomes increases the depth of functional
description by a reiterative process.
GAPPED GENOMES IN WIT
An important feature of the WIT system is its emphasis on
incomplete or gapped genomes. Algorithms used for gene
assignments depend on the size of a dataset used to cluster
(...truncated)