Toward a Public Toxicogenomics Capability for Supporting Predictive Toxicology: Survey of Current Resources and Chemical Indexing of Experiments in GEO and ArrayExpress (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/toxsci/article-pdf/109/2/358/16679889/kfp061.pdf

Toward a Public Toxicogenomics Capability for Supporting Predictive Toxicology: Survey of Current Resources and Chemical Indexing of Experiments in GEO and ArrayExpress

TOXICOLOGICAL SCIENCES 109(2), 358–371 (2009) doi:10.1093/toxsci/kfp061 Advance Access publication March 30, 2009 Toward a Public Toxicogenomics Capability for Supporting Predictive Toxicology: Survey of Current Resources and Chemical Indexing of Experiments in GEO and ArrayExpress ClarLynda R. Williams-Devane,* Maritja A. Wolf,† and Ann M. Richard‡,1 *U.S. EPA/Office of Research and Development (ORD)/National Health & Environmental Effects Research Laboratory (NHEERL), Research Triangle Park, NC 27519; †Lockheed Martin (Contractor to U.S. EPA), Research Triangle Park, NC 27519; and ‡U.S. EPA/Office of Research and Development (ORD)/National Center for Computational Toxicology (NCCT), Research Triangle Park, NC 27519 Received January 18, 2009; accepted March 23, 2009 A publicly available toxicogenomics capability for supporting predictive toxicology and meta-analysis depends on availability of gene expression data for chemical treatment scenarios, the ability to locate and aggregate such information by chemical, and broad data coverage within chemical, genomics, and toxicological information domains. This capability also depends on common genomics standards, protocol description, and functional linkages of diverse public Internet data resources. We present a survey of public genomics resources from these vantage points and conclude that, despite progress in many areas, the current state of the majority of public microarray databases is inadequate for supporting these objectives, particularly with regard to chemical indexing. To begin to address these inadequacies, we focus chemical annotation efforts on experimental content contained in the two primary public genomic resources: ArrayExpress and Gene Expression Omnibus. Automated scripts and extensive manual review were employed to transform free-text experiment descriptions into a standardized, chemically indexed inventory of experiments in both resources. These files, which include top-level summary annotations, allow for identification of current chemicalassociated experimental content, as well as chemical-exposure– related (or ‘‘Treatment’’) content of greatest potential value to toxicogenomics investigation. With these chemical-index files, it is possible for the first time to assess the breadth and overlap of chemical study space represented in these databases, and to begin to assess the sufficiency of data with shared protocols for chemical similarity inferences. Chemical indexing of public genomics databases is a first important step toward integrating chemical, toxicological and genomics data into predictive toxicology. Key Words: microarray; chemical; toxicogenomics; toxicity; prediction. Disclaimer: This manuscript was approved by the U.S. EPA’s National Center for Computational Toxicology for publication. However, the contents do not necessarily reflect the views and policies of the EPA and mention of trade names or commercial products does not constitute endorsement or recommendation for use. Each of the authors declares no competing interests pertaining to the present work. 1 To whom correspondence should be addressed at Mail Drop D343-03, 109 TW Alexander Dr., U.S. Environmental Protection Agency, Research Triangle Park, NC 27711. Fax: (919) 685-3263. E-mail: . Conventional toxicology investigates cellular and animal responses to chemical treatment through domain-specific bioassay studies (e.g., chronic, developmental), typically mapping a single chemical to a toxicological endpoint. Microarray technologies, in contrast, detect genome-wide perturbations resulting from a chemical treatment, and measure response variables that probe a large number of genes and gene pathways potentially underlying multiple toxicological endpoints. A typical toxicogenomics experiment requires that linkages be established between these technologies, focusing on treatmentrelated effects of one or a few chemicals and attempting to relate gene expression changes to a toxicological endpoint (Gomase et al., 2008; Hamadeh et al., 2002; Hirabayashi and Inoue, 2002). In silico toxicogenomic meta-analysis methods combine data across existing toxicological and gene expression experiments to generate new, and to confirm existing hypotheses of the effect of a compound treatment. Such a capability depends upon the availability of gene expression data derived from chemical treatment scenarios, as well as anchoring toxicology data to support predictive inferences. The chemical nature of the problem requires a standardized, chemical-centric view of data at all levels. Hence, a publicly available toxicogenomics capability sufficiently robust for mechanistic inferences and building predictive models requires not only common data standards, protocols, and the ability to query and aggregate common data types across resources, but also broad data coverage within, and linkages across chemical, genomics and toxicological information domains. These requirements have, to varying degrees, informed development of the major public microarray databases, and have been the central design principle of specialized toxicogenomic resources (Waters et al., 2008). In recent years, there have also been significant advances in promoting toxicology standards and data models (i.e., controlled vocabulary and hierarchical data organization), quantitative high-throughput screening, and chemically indexed bioassay data that, taken as a whole, have Published by Oxford University Press 2009. 359 CHEMICAL INDEXING OF TOXICOGENOMICS RESOURCES the potential to greatly enhance toxicogenomics capabilities in the public domain (Dix et al., 2007; Martin et al., 2009; Richard et al., 2008; Yang et al., 2006a, 2006b). In the genomics field, the two largest public resources for deposition of microarray data, approved by the Microarray Gene Expression Data (MGED) Society (http://www.mged. org/), are the European Bioinformatics Institute’s (EBI) ArrayExpress (http://www.ebi.ac.uk/arrayexpress) and the National Center for Biotechnology Information’s (NCBI) Gene Expression Omnibus (GEO) (www.ncbi.nlm.nih.gov/geo). Publishing requirements for the deposition of raw or processed microarray data into these database repositories, coupled with MIAME (Minimum Information About a Microarray Experiment) standards for data reporting, are increasing the comparability, utility and breadth of these resources (Ball et al., 2004). Enhanced external programmatic access to the major public microarray data repositories also allows third parties to automatically extract and reformulate data to enhance informatics and data mining capabilities (Boyle, 2005; Ivliev et al., 2008; Zhu et al., 2008). Additional public efforts are aimed at standardizing the description of experimental protocols (Taylor et al., 2008), as well as improving toxicity data standards in relation to toxicogenomics experiments (Burgoon, 2007; Fostel, 2008; Fostel et al. 2005, 2007). Largely neglected in the genomics field, ho (...truncated)