CEBS object model for systems biology data, SysBio-OM
Vol. 20 no. 13 2004, pages 2004–2015
doi:10.1093/bioinformatics/bth189
BIOINFORMATICS
CEBS object model for systems biology data,
SysBio-OM
Sandhya Xirasagar1,†, ∗, Scott Gustafson1,† , B. Alex Merrick2 ,
Kenneth B. Tomer2 , Stanley Stasiewicz2 , Denny D. Chan1 ,
Kenneth J. Yost III1 , John R. Yates III3 , Susan Sumner4 ,
Nianqing Xiao1 and Michael D. Waters2
Applications International Corporation, 20201 Century Building, 3rd Floor,
Germantown, MD 20874, USA, 2 NIEHS, National Center for Toxicogenomics,
P.O. Box 12233, Research Triangle Park, NC 27709, USA, 3 Department of Cell Biology,
The Scripps Research Institute, 10550 North Torrey Pine Road, La Jolla, CA 92037,
USA and 4 Paradigm Genetics, Inc., 108 TW Alexander Drive, P.O. Box 14528,
Research Triangle Park, NC 27709, USA
Received on September 24, 2003; revised on January 14, 2004; accepted on February 25, 2004
Advance Access publication March 25, 2004
ABSTRACT
Motivation: To promote a systems biology approach
to understanding the biological effects of environmental
stressors, the Chemical Effects in Biological Systems (CEBS)
knowledge base is being developed to house data from multiple complex data streams in a systems friendly manner that
will accommodate extensive querying from users. Unified data
representation via a single object model will greatly aid in integrating data storage and management, and facilitate reuse of
software to analyze and display data resulting from diverse
differential expression or differential profile technologies. Data
streams include, but are not limited to, gene expression analysis (transcriptomics), protein expression and protein–protein
interaction analysis (proteomics) and changes in low molecular
weight metabolite levels (metabolomics).
Results: To enable the integration of microarray gene expression, proteomics and metabolomics data in the CEBS
system, we designed an object model, Systems Biology
Object Model (SysBio-OM). The model is comprehensive
and leverages other open source efforts, namely the MicroArray Gene Expression Object Model (MAGE-OM) and the
Proteomics Experiment Data Repository (PEDRo) object
model. SysBio-OM is designed by extending MAGE-OM to represent protein expression data elements (including those from
PEDRo), protein–protein interaction and metabolomics data.
SysBio-OM promotes the standardization of data representation and data quality by facilitating the capture of the minimum
annotation required for an experiment. Such standardization
refines the accuracy of data mining and interpretation. The
∗ To
whom correspondence should be addressed.
†
The authors wish it to be known that, in their opinion, these two authors
should be regarded as joint First Authors.
2004
open source SysBio-OM model, which can be implemented
on varied computing platforms is presented here.
Availability: A universal modeling language depiction of the
entire SysBio-OM is available at http://cebs.niehs.nih.gov/
SysBioOM/. The Rational Rose object model package is
distributed under an open source license that permits unrestricted academic and commercial use and is available at
http://cebs.niehs.nih.gov/cebsdownloads. The database and
interface are being built to implement the model and will be
available for public use at http://cebs.niehs.nih.gov.
Contact:
INTRODUCTION
Current research trends emphasize the need to integrate data
from studies monitoring changes in expression of genes,
proteins and metabolites as a consequence of perturbing biological systems (Ideker et al., 2001; Waters et al., 2003).
Comparisons of gene, protein and metabolite data will be
invaluable in promoting a global understanding of how
biological systems function and respond to environmental
stressors (Amin et al., 2002; Witzmann and Grant, 2003;
Lindon et al., 2003). The Chemical Effects in Biological Systems (CEBS) toxicogenomics knowledge base is designed to
integrate data resulting from these three disciplines in addition to conventional toxicology data (Waters et al., 2003).
A standard representation of data types within each discipline is an important prerequisite for efficient and accurate
storage, access, analysis, comparison and data exchange.
Furthermore, valid conclusions are possible only if the data
is sufficiently well annotated with contextual information
regarding its origin. These observations have led to the development of a draft Minimum Information About Microarray
Experiment (MIAME)/Tox guideline (http://www.mged.org/
Bioinformatics 20(13) © Oxford University Press 2004; all rights reserved.
1 Science
CEBS object model
Transcriptomics
High-throughput platforms such as microarrays (DeRisi et al.,
1996, 1997; Chu et al., 1998; Hughes et al., 2000) and Serial
Analysis of Gene Expression (SAGE) (Velculescu et al., 1995)
have evolved to monitor gene expression patterns in biological
samples. The research objective is often focused on quantitative comparisons of gene expression between control states and
states induced by specific chemicals, diseases, environments
or therapeutic treatments. The general workflow of a typical
DNA microarray experiment consists of hybridizing fluorescently labeled mRNA from control and experimental samples
to DNA microarray chips on which thousands of gene-specific
reporter(s) sequences are arrayed. Images of the array are
then acquired and analyzed to obtain the ratio of control and
experimental signal intensities, which enable the inference of
relative fold changes in gene expression exerted by the experimental treatment or disease state. The basic concept of SAGE
rests on two principles: first, a small sequence of nucleotides
from the transcript, called a ‘tag’, can effectively identify the
original transcript from where it came, and second, these tags
can be linked enabling rapid sequence analysis of multiple
transcripts. The number of times that a specific transcript is
identified in a given sample is used as an indicator of the level
of gene expression corresponding to the transcript.
Proteomics
Proteomics research involves measurement of changing protein expression profiles as affected by chemical toxicity, disease state, environmental insult or therapeutic treatment. The
plethora of proteomics platforms reflects the choice of measuring specific attributes of proteins including protein identity
(amino acid sequence), mass, charge, post-translational modifications, protein–protein interactions, subcellular location
and biological activity. Currently, most proteomics laboratories employ combinations of proteomics platforms since no
single technique can measure all protein characteristics in a
comprehensive manner (Figeys, 2003). Consequently, several
scenarios for proteomics experiments can be envisioned, as
illustrated below, reflecting the complex workflow in these
experiments.
Depending on the cellular compartment of interest, procedures for organelle separation (Rappsilber et al., 2002;
Galeva and Altermann, 2002) may be used prior to p (...truncated)