SIGNATURE: A workbench for gene expression signature analysis
BMC Bioinformatics
SIGNATURE: A workbench for gene expression signature analysis
Jeffrey T Chang 0
Michael L Gatza 2
Joseph E Lucas 2
William T Barry 1 2
Peyton Vaughn 2
Joseph R Nevins 2 3
0 Department of Integrative Biology and Pharmacology University of Texas Health Science Center at Houston , Houston TX , USA
1 Department of Biostatistics and Bioinformatics Duke University Medical Center , Durham NC , USA
2 Institute for Genome Sciences and Policy Duke University and Duke University Medical Center , Durham NC , USA
3 Department of Molecular Genetics and Microbiology Duke University Medical Center , Durham NC , USA
Background: The biological phenotype of a cell, such as a characteristic visual image or behavior, reflects activities derived from the expression of collections of genes. As such, an ability to measure the expression of these genes provides an opportunity to develop more precise and varied sets of phenotypes. However, to use this approach requires computational methods that are difficult to implement and apply, and thus there is a critical need for intelligent software tools that can reduce the technical burden of the analysis. Tools for gene expression analyses are unusually difficult to implement in a user-friendly way because their application requires a combination of biological data curation, statistical computational methods, and database expertise. Results: We have developed SIGNATURE, a web-based resource that simplifies gene expression signature analysis by providing software, data, and protocols to perform the analysis successfully. This resource uses Bayesian methods for processing gene expression data coupled with a curated database of gene expression signatures, all carried out within a GenePattern web interface for easy use and access. Conclusions: SIGNATURE is available for public use at http://genepattern.genome.duke.edu/signature/.
-
Background
Gene expression signatures are powerful tools that can
reveal a range of biologically and clinically important
characteristics of biological samples. In recent years,
signatures have been developed that can differentiate
distinct subtypes of tumors, identify important cellular
responses to their environment (hypoxia), predict
clinical outcomes in cancer, and model the activation of
signaling pathways [1]. The power of gene expression
signatures derives from their ability to connect an in
vitro experimental state with an in vivo one in a
quantitative manner. Commonly, the term gene expression
signature has been used in two ways. In one, the signature
is comprised of a set of genes that share a common
pattern of expression. Sometimes this can be reported as
genes that increase or decrease in expression, but the
basic characteristic of the signature is the identity of the
genes. Because of this, these signatures are often called
gene sets. Gene sets have been curated from the
literature and collected into databases such as MSigDB and
GeneSigDB [2,3]. Tools have been developed that can
analyze gene sets by looking for shared function or
characteristics such as Gene Ontology terms [4] or drug
sensitivity [5]. Another tool, single-sample GSEA has
been previously applied to predict the co-regulation of
gene sets from MSigDB on gene expression samples [6].
Evidence of co-regulation is then used to infer the
activation of the phenotype embodied by the gene set.
The second type of signature relates the magnitude of
increase or decrease in gene expression, in the form of
weighted values, to a biological phenotype using a
quantitative predictive model [6-16]. These signatures are
often developed from experimental conditions that
precisely control the phenotype of interest - for instance,
the activation of a cell signaling pathway or the response
of cells to a defined stimulus. Since the signature is
comprised of a quantitative measure of the expression
levels of genes that define the phenotype, it allows a
direct measurement of the phenotype, rather than an
indirect inference through co-regulation of genes in a
gene set. A limitation of this approach, however, is the
complexity of the methods used to derive and analyze
the signatures, making it difficult to apply without
significant multidisciplinary expertise [17].
Three major obstacles hinder the broad use of
signatures. First, gene expression signature analysis requires
the rigorous application of complex statistical
methodologies on gene expression data. Second, it requires
the acquisition and validation of data that properly
capture the biological state of interest. Third, it
requires a computational infrastructure that makes
available the statistical software and data in an easy to
use interface. In sum, gene expression signature
analysis requires a confluence of expertise across a range of
disciplines, including statistics, biology, and computer
science.
While others have previously made use of our
approach [16], it does require a level of expertise and
computational infrastructure not always available in
biological laboratories. This bioinformatic investigation,
requiring the proper selection and application of
statistical algorithms, as well as biological curation and
validation of the signatures, can be daunting. Therefore, a
challenge is how to develop software tools that enable
such analyses for the general user. While it has long
been recognized that software can target different
types of users, a set of principles for software that is
biologist-friendly was recently described [18]. In short,
the recommendations are that the software 1) requires
no knowledge of programming, 2) allows application of
advanced methods, 3) can be used on different
operating systems, and 4) provides a natural language
description of the results. While such software has
been developed for biological sequence alignment [19],
sequence annotation [20], phylogenetic analysis [21],
and comparison of prokaryotic genomes [22], no such
platform exists for gene expression signature analysis.
Because of this, and also because of the technical
difficulty in performing gene expression analysis, we
believe there is a need for a platform that captures a
carefully refined analysis workflow, coupling algorithms
and data, and enables a researcher to predict gene
expression signatures on their samples.
Implementation
To address the critical need for a platform for gene
expression signature analysis, we have developed a
collection of tools over the course of several years. First,
we have developed BinReg, a statistical algorithm to
predict the activation of a gene expression signature on a
data set [23,24]. Second, we have curated a database of
signatures that predict the activation of oncogenic
pathways [25]. Now, we report on the development of a
computational platform that combines these in a
biologist-friendly interface, using the principles previously
established. Here we describe the three components of a
novel gene expression signature analysis platform, which
we colle (...truncated)