WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction (pdf)

Article PDF cannot be displayed. You can download it here:

https://nar.oxfordjournals.org/content/28/1/123.full.pdf

WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction

Ross Overbeek 1 Niels Larsen 1 Gordon D. Pusch 1 Mark D'Souza 0 1 Evgeni Selkov Jr 0 1 Nikos Kyrpides 1 Michael Fonstein 1 Natalia Maltsev 0 Evgeni Selkov 0 1 0 Mathematics and Computer Science Division, Argonne National Laboratory , Argonne, IL 60439, USA 1 Integrated Genomics Inc. , 2201 W. Campbell Park Drive, Chicago, IL 60612, USA The WIT (What Is There) (http://wit.mcs.anl.gov/WIT2/ ) system has been designed to support comparative analysis of sequenced genomes and to generate metabolic reconstructions based on chromosomal sequences and metabolic modules from the EMP/MPW family of databases. This system contains data derived from about 40 completed or nearly completed genomes. Sequence homologies, various ORFclustering algorithms, relative gene positions on the chromosome and placement of gene products in metabolic pathways (metabolic reconstruction) can be used for the assignment of gene functions and for development of overviews of genomes within WIT. The integration of a large number of phylogenetically diverse genomes in WIT facilitates the understanding of the physiology of different organisms. - Starting with Haemophilus influenza (1) in 1995, over 20 microbial organisms have had their total genomic DNA sequenced and almost 100 others have been started as shown in the GOLD database (2). Currently we are observing an impressive development of the human genome project (3,4). In response to this growing amount of sequence data, computational tools for genome analysis have been developed and merged into shared analytical environments, such as GeneQuiz (5), KEGG (6), Pedant (7) and Entrez Genomes (8), moving cross-genome analysis to a new level. The development of analytical systems, together with the growth of sequencing data, have increased gene recognition rates from <50% (9,10) to >70% (11,12). Today, this remaining 30%, so-called hypothetical or orphan genes, separates us from a complete description of the genomic content and functions of an organism. Computational approaches based on various types of clustering of potential genes, whether in phylogenetic space, as clusters of orthologous genes (COGs) (13) or position on the chromosome, such as in operons (14), increase the gene assignment level even further. An important stage of genome analysis is the integration of gene assignments into an organism-specific overview via so-called functional reconstruction (15), which is the conceptual assembly of metabolic pathways, transport units and signal transduction pathways. It allows reconciliation of inconsistencies between different types of analysis, and often results in changes of initial gene function assignments based on similarity scoring. The WIT system, discussed in this paper, represents the development of a genome analysis strategy in a multi-genome environment, which combines a variety of tools, dealing with individual open reading frames (ORFs) or proteins, with the ability to derive general conclusions. Using the WIT genome analysis system, a major part of the central metabolism of an organism can be reconstructed entirely in silico (16). WIT: A VIEW TO A GENOME The current version of the WIT system is available at Argonne National Laboratory (http://wit.mcs.anl.gov/WIT2 ) or at Integrated Genomics Inc. (http://wit.IntegratedGenomics.com/IGwit ) and contains 43 complete or nearly complete genomes (Table 1). These genomes consist of 123 482 predicted ORFs, of which 78 144 could be given functional assignments and 41 742 could be assembled into metabolic pathways, which came from EMP/MPW database (15). Pathways involved in the metabolism of carbohydrates and amino acids are connected into schematic overviews allowing the user to reveal substrates and final products connecting metabolic modules. In order to incorporate a genome into WIT, a gene-searching program called CRITICA (17) can be used. Potential coding regions recognized in the DNA contigs are subjected to a FASTA search against the non-redundant database of assigned genes and loaded into the WIT system, together with the precomputed tables of best hits. WIT provides a set of tools for the characterization of gene structures and functions, such as Functional Coupling, or Preserved Operons. WIT also provides integrated WWW access to such tools as PSI-BLAST, PROSITE, ProDom, COG, ClustalW and others. Functional content may be queried, for example, by looking for specific functions missing in the metabolic pathways, or by separating alternative gene functions derived from similarities found for a putative gene. After genes have been assigned initial functions, they are then attached to pathways by choosing templates from metabolic database (MPW) which best incorporate all observed functions. For any given organism, this usually leads to identification of Saccharomyces cerevisiae, Caenorhabditis elegans Sulfolobus solfataricus, Archaeoglobus fulgidus,Halobacterium sp., M.thermoautotrophicum, M.jannaschii, Pyrococcus furiosus, Pyrococcus horikoshii A.aeolicus, C.trachomatis, Synechocystis sp., P.gingivalis, M.leprae, M.tuberculosis, B.subtilis, C.acetobutylicum, E.faecalis, M.genitalium, M.pneumoniae, S.pneumoniae, S.pyogenes, Rhizobium sp., R.capsulatus, S.aromaticivorans, N.gonorrhoeae, N.meningitidis, C.jejuni, H.pylori, E.coli, Y.pestis, H.influenzae, P.aeruginosa, B.burgdorferi, T.pallidum, D.radiodurans Additional Genomes on the public server at Integrated Genomics Inc. functional sub-systems, as a model for further refinement. For example, it is now possible to identify inconsistencies, potentially missing enzymes/ORFs, thereby refining the model. When a basic model has been created, a curator finally evaluates this model against biochemical data and phenotypes known from the literature. The models come in both textual and graphical representations, fully linked with all underlying data. We call this whole process metabolic reconstruction, and the main role of the WIT system is to support this effort. To examine or curate a functional model of an organism, one can use functions such as: Compare assignments, Summary of asserted functions and pathways, Examine trimmed ortholog clusters, Examine COG/trimmed ortholog cluster relationships, Search for pathways by regular expression, Search ORF functions by regular expression, Search ORF sequences by similarity search, Find NCBIs MEDLINE-references by EC-number, Search EMP by EC-number, and Find common proteins for organisms. Chromosomal clustering of functionally related genes (14) is another powerful component of the system, which recently allowed us to propose a number of candidate ORFs for orphan metabolic functions. Continuous integration of newly sequenced genomes increases the depth of functional description by a reiterative process. GAPPED GENOMES IN WIT An important feature of the WIT system is its emphasis on incomplete or gapped genomes. Algorithms used for gene assignments depend on the size of a dataset used to cluster (...truncated)