CEBS object model for systems biology data, SysBio-OM
Sandhya Xirasagar
2
Scott Gustafson
2
B. Alex Merrick
1
Kenneth B. Tomer
1
Stanley Stasiewicz
1
Denny D. Chan
2
Kenneth J. Yost III
2
John R. Yates III
0
Susan Sumner
3
Nianqing Xiao
2
Michael D. Waters
1
0
Department of Cell Biology, The Scripps Research Institute
, 10550 North Torrey Pine Road,
La Jolla, CA 92037, USA
1
NIEHS,
National Center for Toxicogenomics
, P.O. Box 12233,
Research Triangle Park
,
NC 27709, USA
2
Science Applications International Corporation
, 20201 Century Building, 3rd Floor, Germantown,
MD 20874, USA
3
Paradigm Genetics
, Inc., 108 TW Alexander Drive, P.O. Box 14528,
Research Triangle Park
,
NC 27709, USA
Motivation: To promote a systems biology approach to understanding the biological effects of environmental stressors, the Chemical Effects in Biological Systems (CEBS) knowledge base is being developed to house data from multiple complex data streams in a systems friendly manner that will accommodate extensive querying from users. Unified data representation via a single object model will greatly aid in integrating data storage and management, and facilitate reuse of software to analyze and display data resulting from diverse differential expression or differential profile technologies. Data streams include, but are not limited to, gene expression analysis (transcriptomics), protein expression and protein-protein interaction analysis (proteomics) and changes in low molecular weight metabolite levels (metabolomics). Results: To enable the integration of microarray gene expression, proteomics and metabolomics data in the CEBS system, we designed an object model, Systems Biology Object Model (SysBio-OM). The model is comprehensive and leverages other open source efforts, namely the MicroArray Gene Expression Object Model (MAGE-OM) and the Proteomics Experiment Data Repository (PEDRo) object model. SysBio-OM is designed by extending MAGE-OM to represent protein expression data elements (including those from PEDRo), protein-protein interaction and metabolomics data. SysBio-OM promotes the standardization of data representation and data quality by facilitating the capture of the minimum annotation required for an experiment. Such standardization refines the accuracy of data mining and interpretation. The
-
open source SysBio-OM model, which can be implemented
on varied computing platforms is presented here.
Availability: A universal modeling language depiction of the
entire SysBio-OM is available at http://cebs.niehs.nih.gov/
SysBioOM/. The Rational Rose object model package is
distributed under an open source license that permits
unrestricted academic and commercial use and is available at
http://cebs.niehs.nih.gov/cebsdownloads. The database and
interface are being built to implement the model and will be
available for public use at http://cebs.niehs.nih.gov.
Contact:
INTRODUCTION
Current research trends emphasize the need to integrate data
from studies monitoring changes in expression of genes,
proteins and metabolites as a consequence of perturbing
biological systems (Ideker et al., 2001; Waters et al., 2003).
Comparisons of gene, protein and metabolite data will be
invaluable in promoting a global understanding of how
biological systems function and respond to environmental
stressors (Amin et al., 2002; Witzmann and Grant, 2003;
Lindon et al., 2003). The Chemical Effects in Biological
Systems (CEBS) toxicogenomics knowledge base is designed to
integrate data resulting from these three disciplines in
addition to conventional toxicology data (Waters et al., 2003).
A standard representation of data types within each
discipline is an important prerequisite for efficient and accurate
storage, access, analysis, comparison and data exchange.
Furthermore, valid conclusions are possible only if the data
is sufficiently well annotated with contextual information
regarding its origin. These observations have led to the
development of a draft Minimum Information About Microarray
Experiment (MIAME)/Tox guideline (http://www.mged.org/
MIAME1.1-DenverDraft.DOC) and the formation of a
Microarray Gene Expression Database (MGED)
Toxicogenomics Working Group () that
addresses minimum information about a microarray
experiment in the realm of toxicogenomics. Encapsulation of
toxicogenomics data access and representation within a
common object model will greatly facilitate software reuse and
rapid application development. The Systems Biology Object
Model (SysBio-OM) was developed around these two major
concerns. Below, we illustrate the experimental scenarios
in transcriptomics, proteomics and metabolomics disciplines
that helped formalize the requirements of the integrated model
design. The model can also be extended to accommodate
additional data streams from these or additional disciplines.
Transcriptomics
High-throughput platforms such as microarrays (DeRisi et al.,
1996, 1997; Chu et al., 1998; Hughes et al., 2000) and Serial
Analysis of Gene Expression (SAGE) (Velculescu et al., 1995)
have evolved to monitor gene expression patterns in biological
samples. The research objective is often focused on
quantitative comparisons of gene expression between control states and
states induced by specific chemicals, diseases, environments
or therapeutic treatments. The general workflow of a typical
DNA microarray experiment consists of hybridizing
fluorescently labeled mRNA from control and experimental samples
to DNA microarray chips on which thousands of gene-specific
reporter(s) sequences are arrayed. Images of the array are
then acquired and analyzed to obtain the ratio of control and
experimental signal intensities, which enable the inference of
relative fold changes in gene expression exerted by the
experimental treatment or disease state. The basic concept of SAGE
rests on two principles: first, a small sequence of nucleotides
from the transcript, called a tag, can effectively identify the
original transcript from where it came, and second, these tags
can be linked enabling rapid sequence analysis of multiple
transcripts. The number of times that a specific transcript is
identified in a given sample is used as an indicator of the level
of gene expression corresponding to the transcript.
Proteomics
Proteomics research involves measurement of changing
protein expression profiles as affected by chemical toxicity,
disease state, environmental insult or therapeutic treatment. The
plethora of proteomics platforms reflects the choice of
measuring specific attributes of proteins including protein identity
(amino acid sequence), mass, charge, post-translational
modifications, proteinprotein interactions, subcellular location
and biological activity. Currently, most proteomics
laboratories employ combinations of proteomics platforms since no
single technique can measure all protein characteristics in a
comprehensive manner (Figeys, 2003). Consequently, several
scenarios for proteomics experiments can be envisioned, as
illustrated below, reflecting the complex wo (...truncated)