Toward a Public Toxicogenomics Capability for Supporting Predictive Toxicology: Survey of Current Resources and Chemical Indexing of Experiments in GEO and ArrayExpress
TOXICOLOGICAL SCIENCES 109(2), 358–371 (2009)
doi:10.1093/toxsci/kfp061
Advance Access publication March 30, 2009
Toward a Public Toxicogenomics Capability for Supporting Predictive
Toxicology: Survey of Current Resources and Chemical Indexing of
Experiments in GEO and ArrayExpress
ClarLynda R. Williams-Devane,* Maritja A. Wolf,† and Ann M. Richard‡,1
*U.S. EPA/Office of Research and Development (ORD)/National Health & Environmental Effects Research Laboratory (NHEERL), Research Triangle Park, NC
27519; †Lockheed Martin (Contractor to U.S. EPA), Research Triangle Park, NC 27519; and ‡U.S. EPA/Office of Research and Development (ORD)/National
Center for Computational Toxicology (NCCT), Research Triangle Park, NC 27519
Received January 18, 2009; accepted March 23, 2009
A publicly available toxicogenomics capability for supporting
predictive toxicology and meta-analysis depends on availability of
gene expression data for chemical treatment scenarios, the ability
to locate and aggregate such information by chemical, and broad
data coverage within chemical, genomics, and toxicological
information domains. This capability also depends on common
genomics standards, protocol description, and functional linkages
of diverse public Internet data resources. We present a survey of
public genomics resources from these vantage points and conclude
that, despite progress in many areas, the current state of the
majority of public microarray databases is inadequate for supporting these objectives, particularly with regard to chemical indexing.
To begin to address these inadequacies, we focus chemical
annotation efforts on experimental content contained in the two
primary public genomic resources: ArrayExpress and Gene
Expression Omnibus. Automated scripts and extensive manual
review were employed to transform free-text experiment descriptions into a standardized, chemically indexed inventory of experiments in both resources. These files, which include top-level
summary annotations, allow for identification of current chemicalassociated experimental content, as well as chemical-exposure–
related (or ‘‘Treatment’’) content of greatest potential value to
toxicogenomics investigation. With these chemical-index files, it is
possible for the first time to assess the breadth and overlap of
chemical study space represented in these databases, and to begin
to assess the sufficiency of data with shared protocols for chemical
similarity inferences. Chemical indexing of public genomics
databases is a first important step toward integrating chemical,
toxicological and genomics data into predictive toxicology.
Key Words: microarray; chemical; toxicogenomics; toxicity;
prediction.
Disclaimer: This manuscript was approved by the U.S. EPA’s National
Center for Computational Toxicology for publication. However, the contents
do not necessarily reflect the views and policies of the EPA and mention of
trade names or commercial products does not constitute endorsement or
recommendation for use. Each of the authors declares no competing interests
pertaining to the present work.
1
To whom correspondence should be addressed at Mail Drop D343-03, 109
TW Alexander Dr., U.S. Environmental Protection Agency, Research Triangle
Park, NC 27711. Fax: (919) 685-3263. E-mail: .
Conventional toxicology investigates cellular and animal
responses to chemical treatment through domain-specific bioassay studies (e.g., chronic, developmental), typically mapping
a single chemical to a toxicological endpoint. Microarray
technologies, in contrast, detect genome-wide perturbations
resulting from a chemical treatment, and measure response
variables that probe a large number of genes and gene pathways
potentially underlying multiple toxicological endpoints. A
typical toxicogenomics experiment requires that linkages be
established between these technologies, focusing on treatmentrelated effects of one or a few chemicals and attempting to relate
gene expression changes to a toxicological endpoint (Gomase
et al., 2008; Hamadeh et al., 2002; Hirabayashi and Inoue, 2002).
In silico toxicogenomic meta-analysis methods combine data
across existing toxicological and gene expression experiments to
generate new, and to confirm existing hypotheses of the effect of
a compound treatment. Such a capability depends upon the
availability of gene expression data derived from chemical
treatment scenarios, as well as anchoring toxicology data to
support predictive inferences.
The chemical nature of the problem requires a standardized,
chemical-centric view of data at all levels. Hence, a publicly
available toxicogenomics capability sufficiently robust for
mechanistic inferences and building predictive models requires
not only common data standards, protocols, and the ability to
query and aggregate common data types across resources, but
also broad data coverage within, and linkages across chemical,
genomics and toxicological information domains. These
requirements have, to varying degrees, informed development
of the major public microarray databases, and have been the
central design principle of specialized toxicogenomic resources
(Waters et al., 2008). In recent years, there have also been
significant advances in promoting toxicology standards and
data models (i.e., controlled vocabulary and hierarchical data
organization), quantitative high-throughput screening, and
chemically indexed bioassay data that, taken as a whole, have
Published by Oxford University Press 2009.
359
CHEMICAL INDEXING OF TOXICOGENOMICS RESOURCES
the potential to greatly enhance toxicogenomics capabilities in
the public domain (Dix et al., 2007; Martin et al., 2009;
Richard et al., 2008; Yang et al., 2006a, 2006b).
In the genomics field, the two largest public resources for
deposition of microarray data, approved by the Microarray
Gene Expression Data (MGED) Society (http://www.mged.
org/), are the European Bioinformatics Institute’s (EBI)
ArrayExpress (http://www.ebi.ac.uk/arrayexpress) and the
National Center for Biotechnology Information’s (NCBI) Gene
Expression Omnibus (GEO) (www.ncbi.nlm.nih.gov/geo).
Publishing requirements for the deposition of raw or processed
microarray data into these database repositories, coupled with
MIAME (Minimum Information About a Microarray Experiment) standards for data reporting, are increasing the comparability, utility and breadth of these resources (Ball et al., 2004).
Enhanced external programmatic access to the major public
microarray data repositories also allows third parties to
automatically extract and reformulate data to enhance informatics and data mining capabilities (Boyle, 2005; Ivliev et al., 2008;
Zhu et al., 2008). Additional public efforts are aimed at
standardizing the description of experimental protocols (Taylor
et al., 2008), as well as improving toxicity data standards in
relation to toxicogenomics experiments (Burgoon, 2007; Fostel,
2008; Fostel et al. 2005, 2007). Largely neglected in the
genomics field, ho (...truncated)