Integrating diverse genomic data using gene sets
Tyekucheva et al. Genome Biology
Integrating diverse genomic data using gene sets
Svitlana Tyekucheva 0
Luigi Marchionni
Rachel Karchin
Giovanni Parmigiani 0
0 Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute , 450 Brookline Avenue, Boston, MA 02115 , USA
We introduce and evaluate data analysis methods to interpret simultaneous measurement of multiple genomic features made on the same biological samples. Our tools use gene sets to provide an interpretable common scale for diverse genomic information. We show we can detect genetic effects, although they may act through different mechanisms in different samples, and show we can discover and validate important disease-related gene sets that would not be discovered by analyzing each data type individually.
-
Background
The increasing affordability of high throughput
genomewide assays is enabling the simultaneous measurement
of several genomic features in the same biological
samples. Cancer genome projects have been at the forefront
of this trend, and have faced the challenge of integrating
these diverse data types [1,2], including RNA
transcriptional levels, genotype variation, DNA copy number
variation, and epigenetic marks. Annotated collections of
gene sets, capturing established knowledge about
biological processes and pathways, have proven an essential
tool for integration. Examples of these sets include
chromosomal locations, signaling and metabolic pathways,
transcriptional programs, and targets of specific
transcription factors. Because one can make inferences
about the importance of a given gene set using several
different genomic data types, gene set analysis provides
a direct and biologically motivated approach to
analyzing these data types in an integrated way. A widely used
public collection of gene sets is the Molecular
Signatures Database (MSigDb) [3]. A comprehensive list of
conventional tools for gene set analysis for a single data
type is given in Ackermann et al. [4]. Many of these
approaches are implemented in the extensively used
statistical computing environment R/Bioconductor [5].
The gene set perspective makes sense both biologically
and statistically. First, small differences in the functions
of multiple genes in the same set may not be detectable
at the single gene level, but can add to create larger
differences at the gene set level. This increases the power
for detecting real biological differences. Second, a single
hit on a given pathway may be sufficient to generate a
phenotypic difference. If this hit can occur in any of
several components in the pathway, individuals with the
same phenotype may show variability in the specific
genes that are hit, but show a more consistent pattern
at the pathway or gene set level [1,6]. Importantly, even
when a difference at the single gene level can be
detected, its biological importance may depend on the
states of other interacting genes and gene products.
Cancer genomes contain point mutations, insertions,
deletions, translocations, methylation abnormalities, and
copy number (CN) and expression changes not seen in
normal tissues. In some cancers, such as glioblastoma
multiforme (GBM), different genes involved in pathways
involving TP53, phosphoinositide 3-kinase (PI3K), and
RB1 are altered in different patients, and, importantly,
these might be altered via different mechanisms [1],
such as point mutations and CN changes. Therefore,
taking into account multiple data types should improve
our ability to detect gene sets associated with a
phenotype.
In recent large-scale cancer genome studies [1,6,7]
preliminary integration approaches have been
successfully applied; however, these approaches have been
tailored to specific contexts. A general, a scalable and
rigorous statistical framework has not yet been
developed. In this article, our goal is to fill this gap. To this
end, we introduce, compare, and systematically evaluate
two alternative set-based data integration approaches.
The first approach is based on computing model-based
gene-to-phenotype association scores for each gene
using all data types together, followed by gene set
analysis of these scores. We term this the integrative
approach. The second is to perform separate
conventional gene set analyses for each data type, and then
derive a consensus significance score using a
meta-analytical approach.
Results
Overview
We present both novel data analyses and controlled
simulations. First, we jointly examine gene expression
and CN variation data about glioblastoma multiforme
tumors from The Cancer Genome Atlas (TCGA) [2],
and detect differences in the Wnt, glycolysis and stress
pathways that appear relevant to differences between
short- and long-term survivors. We also validate these
findings using independent samples from the NCI
Repository for Molecular Brain Neoplasia Data (Rembrandt)
[8]. To provide a rigorous counterpart to these results
we perform extensive simulations. These show that the
integrative approach does enable the discovery of
disease-related gene sets that would not be discovered
when each data type is analyzed individually using
current approaches. Discoveries remain reliable also when
several features are highly noisy.
The Cancer Genome Atlas glioblastoma multiforme study
We consider TCGA glioblastoma data [2] of four types:
two gene expression measurements (E1, E2) and two
CN measurements (C1, C2), described in Materials and
methods. To discover gene sets important in GBM
survival we use an extreme discordant phenotype design
[9] with a total of 95 subjects. GBM patients with a
survival time shorter than the lower quartile (190 days) are
labeled short-term survivors (STSs), and those with a
survival time longer than the upper quartile (594 days)
long-term survivors (LTSs). Such grouping enhances
signal relevant to survival. We used gene sets from the
MSigDb canonical pathways.
First, we consider genes that are measured in all data
types (genes that are measured only in a subset of
platforms are filtered out), and use a competitive gene set
test (see Materials and methods), comparing genes within
a set to the remainder of the annotated genes. The 30 top
sets discovered by the integrative approach are reported
in Table 1. If we consider the top 30 sets, we discover 12
gene sets that are not discovered by any of the standard
single-data-type analyses. The majority of these sets are
related to metabolic processes. Six are involved in
sugarrelated metabolic processes and energy production, and
two (the curated streptomycin biosynthesis pathway, and
its KEGG (Kyoto Encyclopedia of Genes and Genomes)
counterpart, hsa00521) are identified as a result of genes
shared with the sugar metabolism group (six out of eight
genes in the streptomycin biosynthesis set are paralogs of
genes in the glycolysis pathway).
This metabolic shift toward sugar metabolism is not
surprising since it is known that cancer cells in general
[10,11], and glioblastoma cells (...truncated)