CEBS object model for systems biology data, SysBio-OM (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/20/13/2004.full.pdf

CEBS object model for systems biology data, SysBio-OM

Sandhya Xirasagar 2 Scott Gustafson 2 B. Alex Merrick 1 Kenneth B. Tomer 1 Stanley Stasiewicz 1 Denny D. Chan 2 Kenneth J. Yost III 2 John R. Yates III 0 Susan Sumner 3 Nianqing Xiao 2 Michael D. Waters 1 0 Department of Cell Biology, The Scripps Research Institute , 10550 North Torrey Pine Road, La Jolla, CA 92037, USA 1 NIEHS, National Center for Toxicogenomics , P.O. Box 12233, Research Triangle Park , NC 27709, USA 2 Science Applications International Corporation , 20201 Century Building, 3rd Floor, Germantown, MD 20874, USA 3 Paradigm Genetics , Inc., 108 TW Alexander Drive, P.O. Box 14528, Research Triangle Park , NC 27709, USA Motivation: To promote a systems biology approach to understanding the biological effects of environmental stressors, the Chemical Effects in Biological Systems (CEBS) knowledge base is being developed to house data from multiple complex data streams in a systems friendly manner that will accommodate extensive querying from users. Unified data representation via a single object model will greatly aid in integrating data storage and management, and facilitate reuse of software to analyze and display data resulting from diverse differential expression or differential profile technologies. Data streams include, but are not limited to, gene expression analysis (transcriptomics), protein expression and protein-protein interaction analysis (proteomics) and changes in low molecular weight metabolite levels (metabolomics). Results: To enable the integration of microarray gene expression, proteomics and metabolomics data in the CEBS system, we designed an object model, Systems Biology Object Model (SysBio-OM). The model is comprehensive and leverages other open source efforts, namely the MicroArray Gene Expression Object Model (MAGE-OM) and the Proteomics Experiment Data Repository (PEDRo) object model. SysBio-OM is designed by extending MAGE-OM to represent protein expression data elements (including those from PEDRo), protein-protein interaction and metabolomics data. SysBio-OM promotes the standardization of data representation and data quality by facilitating the capture of the minimum annotation required for an experiment. Such standardization refines the accuracy of data mining and interpretation. The - open source SysBio-OM model, which can be implemented on varied computing platforms is presented here. Availability: A universal modeling language depiction of the entire SysBio-OM is available at http://cebs.niehs.nih.gov/ SysBioOM/. The Rational Rose object model package is distributed under an open source license that permits unrestricted academic and commercial use and is available at http://cebs.niehs.nih.gov/cebsdownloads. The database and interface are being built to implement the model and will be available for public use at http://cebs.niehs.nih.gov. Contact: INTRODUCTION Current research trends emphasize the need to integrate data from studies monitoring changes in expression of genes, proteins and metabolites as a consequence of perturbing biological systems (Ideker et al., 2001; Waters et al., 2003). Comparisons of gene, protein and metabolite data will be invaluable in promoting a global understanding of how biological systems function and respond to environmental stressors (Amin et al., 2002; Witzmann and Grant, 2003; Lindon et al., 2003). The Chemical Effects in Biological Systems (CEBS) toxicogenomics knowledge base is designed to integrate data resulting from these three disciplines in addition to conventional toxicology data (Waters et al., 2003). A standard representation of data types within each discipline is an important prerequisite for efficient and accurate storage, access, analysis, comparison and data exchange. Furthermore, valid conclusions are possible only if the data is sufficiently well annotated with contextual information regarding its origin. These observations have led to the development of a draft Minimum Information About Microarray Experiment (MIAME)/Tox guideline (http://www.mged.org/ MIAME1.1-DenverDraft.DOC) and the formation of a Microarray Gene Expression Database (MGED) Toxicogenomics Working Group () that addresses minimum information about a microarray experiment in the realm of toxicogenomics. Encapsulation of toxicogenomics data access and representation within a common object model will greatly facilitate software reuse and rapid application development. The Systems Biology Object Model (SysBio-OM) was developed around these two major concerns. Below, we illustrate the experimental scenarios in transcriptomics, proteomics and metabolomics disciplines that helped formalize the requirements of the integrated model design. The model can also be extended to accommodate additional data streams from these or additional disciplines. Transcriptomics High-throughput platforms such as microarrays (DeRisi et al., 1996, 1997; Chu et al., 1998; Hughes et al., 2000) and Serial Analysis of Gene Expression (SAGE) (Velculescu et al., 1995) have evolved to monitor gene expression patterns in biological samples. The research objective is often focused on quantitative comparisons of gene expression between control states and states induced by specific chemicals, diseases, environments or therapeutic treatments. The general workflow of a typical DNA microarray experiment consists of hybridizing fluorescently labeled mRNA from control and experimental samples to DNA microarray chips on which thousands of gene-specific reporter(s) sequences are arrayed. Images of the array are then acquired and analyzed to obtain the ratio of control and experimental signal intensities, which enable the inference of relative fold changes in gene expression exerted by the experimental treatment or disease state. The basic concept of SAGE rests on two principles: first, a small sequence of nucleotides from the transcript, called a tag, can effectively identify the original transcript from where it came, and second, these tags can be linked enabling rapid sequence analysis of multiple transcripts. The number of times that a specific transcript is identified in a given sample is used as an indicator of the level of gene expression corresponding to the transcript. Proteomics Proteomics research involves measurement of changing protein expression profiles as affected by chemical toxicity, disease state, environmental insult or therapeutic treatment. The plethora of proteomics platforms reflects the choice of measuring specific attributes of proteins including protein identity (amino acid sequence), mass, charge, post-translational modifications, proteinprotein interactions, subcellular location and biological activity. Currently, most proteomics laboratories employ combinations of proteomics platforms since no single technique can measure all protein characteristics in a comprehensive manner (Figeys, 2003). Consequently, several scenarios for proteomics experiments can be envisioned, as illustrated below, reflecting the complex wo (...truncated)