Toward a Public Toxicogenomics Capability for Supporting Predictive Toxicology: Survey of Current Resources and Chemical Indexing of Experiments in GEO and ArrayExpress
ClarLynda R. Williams-Devane
2
3
Maritja A. Wolf
0
2
Ann M. Richard
1
2
0
Lockheed Martin (Contractor to U.S. EPA)
,
Research Triangle Park, NC 27519
1
U.S. EPA/Office of Research and Development (ORD)/National Center for Computational Toxicology (NCCT)
,
Research Triangle Park, NC 27519
2
Disclaimer: This manuscript was approved by the U.S. EPA's National Center for Computational Toxicology for publication. However, the contents do not necessarily reflect the views and policies of the EPA and mention of trade names or commercial products does not constitute endorsement or recommendation for use. Each of the authors declares no competing interests pertaining to the present work. TW Alexander Dr., U.S. Environmental Protection Agency
,
Research Triangle Park, NC 27711. Fax: (919) 685-3263
3
U.S. EPA/Office of Research and Development (ORD)/National Health & Environmental Effects Research Laboratory (NHEERL)
,
Research Triangle Park, NC 27519
A publicly available toxicogenomics capability for supporting predictive toxicology and meta-analysis depends on availability of gene expression data for chemical treatment scenarios, the ability to locate and aggregate such information by chemical, and broad data coverage within chemical, genomics, and toxicological information domains. This capability also depends on common genomics standards, protocol description, and functional linkages of diverse public Internet data resources. We present a survey of public genomics resources from these vantage points and conclude that, despite progress in many areas, the current state of the majority of public microarray databases is inadequate for supporting these objectives, particularly with regard to chemical indexing. To begin to address these inadequacies, we focus chemical annotation efforts on experimental content contained in the two primary public genomic resources: ArrayExpress and Gene Expression Omnibus. Automated scripts and extensive manual review were employed to transform free-text experiment descriptions into a standardized, chemically indexed inventory of experiments in both resources. These files, which include top-level summary annotations, allow for identification of current chemicalassociated experimental content, as well as chemical-exposurerelated (or ''Treatment'') content of greatest potential value to toxicogenomics investigation. With these chemical-index files, it is possible for the first time to assess the breadth and overlap of chemical study space represented in these databases, and to begin to assess the sufficiency of data with shared protocols for chemical similarity inferences. Chemical indexing of public genomics databases is a first important step toward integrating chemical, toxicological and genomics data into predictive toxicology.
-
Conventional toxicology investigates cellular and animal
responses to chemical treatment through domain-specific
bioassay studies (e.g., chronic, developmental), typically mapping
a single chemical to a toxicological endpoint. Microarray
technologies, in contrast, detect genome-wide perturbations
resulting from a chemical treatment, and measure response
variables that probe a large number of genes and gene pathways
potentially underlying multiple toxicological endpoints. A
typical toxicogenomics experiment requires that linkages be
established between these technologies, focusing on
treatmentrelated effects of one or a few chemicals and attempting to relate
gene expression changes to a toxicological endpoint (Gomase
et al., 2008; Hamadeh et al., 2002; Hirabayashi and Inoue, 2002).
In silico toxicogenomic meta-analysis methods combine data
across existing toxicological and gene expression experiments to
generate new, and to confirm existing hypotheses of the effect of
a compound treatment. Such a capability depends upon the
availability of gene expression data derived from chemical
treatment scenarios, as well as anchoring toxicology data to
support predictive inferences.
The chemical nature of the problem requires a standardized,
chemical-centric view of data at all levels. Hence, a publicly
available toxicogenomics capability sufficiently robust for
mechanistic inferences and building predictive models requires
not only common data standards, protocols, and the ability to
query and aggregate common data types across resources, but
also broad data coverage within, and linkages across chemical,
genomics and toxicological information domains. These
requirements have, to varying degrees, informed development
of the major public microarray databases, and have been the
central design principle of specialized toxicogenomic resources
(Waters et al., 2008). In recent years, there have also been
significant advances in promoting toxicology standards and
data models (i.e., controlled vocabulary and hierarchical data
organization), quantitative high-throughput screening, and
chemically indexed bioassay data that, taken as a whole, have
the potential to greatly enhance toxicogenomics capabilities in
the public domain (Dix et al., 2007; Martin et al., 2009;
Richard et al., 2008; Yang et al., 2006a, 2006b).
In the genomics field, the two largest public resources for
deposition of microarray data, approved by the Microarray
Gene Expression Data (MGED) Society (http://www.mged.
org/), are the European Bioinformatics Institutes (EBI)
ArrayExpress (http://www.ebi.ac.uk/arrayexpress) and the
National Center for Biotechnology Informations (NCBI) Gene
Expression Omnibus (GEO) (www.ncbi.nlm.nih.gov/geo).
Publishing requirements for the deposition of raw or processed
microarray data into these database repositories, coupled with
MIAME (Minimum Information About a Microarray
Experiment) standards for data reporting, are increasing the
comparability, utility and breadth of these resources (Ball et al., 2004).
Enhanced external programmatic access to the major public
microarray data repositories also allows third parties to
automatically extract and reformulate data to enhance
informatics and data mining capabilities (Boyle, 2005; Ivliev et al., 2008;
Zhu et al., 2008). Additional public efforts are aimed at
standardizing the description of experimental protocols (Taylor
et al., 2008), as well as improving toxicity data standards in
relation to toxicogenomics experiments (Burgoon, 2007; Fostel,
2008; Fostel et al. 2005, 2007). Largely neglected in the
genomics field, however, has been the standardization of
chemical information associated with the experimental data
when chemical treatment is a primary objective of the
experiment. Such annotation is essential for systematically
relating chemical property and effects information, irrespective
of whether the study has an explicit toxicological focus, across
the diverse data domains potentially contributing to
toxicogenomics. Furthermore, the ability to query, relate, and aggregate
information by chemical and across chemical space is essential
to the goal of chemical screening and toxicity asse (...truncated)