Integrating diverse genomic data using gene sets (pdf)

Article PDF cannot be displayed. You can download it here:

http://genomebiology.com/content/pdf/gb-2011-12-10-r105.pdf

Integrating diverse genomic data using gene sets

Tyekucheva et al. Genome Biology Integrating diverse genomic data using gene sets Svitlana Tyekucheva 0 Luigi Marchionni Rachel Karchin Giovanni Parmigiani 0 0 Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute , 450 Brookline Avenue, Boston, MA 02115 , USA We introduce and evaluate data analysis methods to interpret simultaneous measurement of multiple genomic features made on the same biological samples. Our tools use gene sets to provide an interpretable common scale for diverse genomic information. We show we can detect genetic effects, although they may act through different mechanisms in different samples, and show we can discover and validate important disease-related gene sets that would not be discovered by analyzing each data type individually. - Background The increasing affordability of high throughput genomewide assays is enabling the simultaneous measurement of several genomic features in the same biological samples. Cancer genome projects have been at the forefront of this trend, and have faced the challenge of integrating these diverse data types [1,2], including RNA transcriptional levels, genotype variation, DNA copy number variation, and epigenetic marks. Annotated collections of gene sets, capturing established knowledge about biological processes and pathways, have proven an essential tool for integration. Examples of these sets include chromosomal locations, signaling and metabolic pathways, transcriptional programs, and targets of specific transcription factors. Because one can make inferences about the importance of a given gene set using several different genomic data types, gene set analysis provides a direct and biologically motivated approach to analyzing these data types in an integrated way. A widely used public collection of gene sets is the Molecular Signatures Database (MSigDb) [3]. A comprehensive list of conventional tools for gene set analysis for a single data type is given in Ackermann et al. [4]. Many of these approaches are implemented in the extensively used statistical computing environment R/Bioconductor [5]. The gene set perspective makes sense both biologically and statistically. First, small differences in the functions of multiple genes in the same set may not be detectable at the single gene level, but can add to create larger differences at the gene set level. This increases the power for detecting real biological differences. Second, a single hit on a given pathway may be sufficient to generate a phenotypic difference. If this hit can occur in any of several components in the pathway, individuals with the same phenotype may show variability in the specific genes that are hit, but show a more consistent pattern at the pathway or gene set level [1,6]. Importantly, even when a difference at the single gene level can be detected, its biological importance may depend on the states of other interacting genes and gene products. Cancer genomes contain point mutations, insertions, deletions, translocations, methylation abnormalities, and copy number (CN) and expression changes not seen in normal tissues. In some cancers, such as glioblastoma multiforme (GBM), different genes involved in pathways involving TP53, phosphoinositide 3-kinase (PI3K), and RB1 are altered in different patients, and, importantly, these might be altered via different mechanisms [1], such as point mutations and CN changes. Therefore, taking into account multiple data types should improve our ability to detect gene sets associated with a phenotype. In recent large-scale cancer genome studies [1,6,7] preliminary integration approaches have been successfully applied; however, these approaches have been tailored to specific contexts. A general, a scalable and rigorous statistical framework has not yet been developed. In this article, our goal is to fill this gap. To this end, we introduce, compare, and systematically evaluate two alternative set-based data integration approaches. The first approach is based on computing model-based gene-to-phenotype association scores for each gene using all data types together, followed by gene set analysis of these scores. We term this the integrative approach. The second is to perform separate conventional gene set analyses for each data type, and then derive a consensus significance score using a meta-analytical approach. Results Overview We present both novel data analyses and controlled simulations. First, we jointly examine gene expression and CN variation data about glioblastoma multiforme tumors from The Cancer Genome Atlas (TCGA) [2], and detect differences in the Wnt, glycolysis and stress pathways that appear relevant to differences between short- and long-term survivors. We also validate these findings using independent samples from the NCI Repository for Molecular Brain Neoplasia Data (Rembrandt) [8]. To provide a rigorous counterpart to these results we perform extensive simulations. These show that the integrative approach does enable the discovery of disease-related gene sets that would not be discovered when each data type is analyzed individually using current approaches. Discoveries remain reliable also when several features are highly noisy. The Cancer Genome Atlas glioblastoma multiforme study We consider TCGA glioblastoma data [2] of four types: two gene expression measurements (E1, E2) and two CN measurements (C1, C2), described in Materials and methods. To discover gene sets important in GBM survival we use an extreme discordant phenotype design [9] with a total of 95 subjects. GBM patients with a survival time shorter than the lower quartile (190 days) are labeled short-term survivors (STSs), and those with a survival time longer than the upper quartile (594 days) long-term survivors (LTSs). Such grouping enhances signal relevant to survival. We used gene sets from the MSigDb canonical pathways. First, we consider genes that are measured in all data types (genes that are measured only in a subset of platforms are filtered out), and use a competitive gene set test (see Materials and methods), comparing genes within a set to the remainder of the annotated genes. The 30 top sets discovered by the integrative approach are reported in Table 1. If we consider the top 30 sets, we discover 12 gene sets that are not discovered by any of the standard single-data-type analyses. The majority of these sets are related to metabolic processes. Six are involved in sugarrelated metabolic processes and energy production, and two (the curated streptomycin biosynthesis pathway, and its KEGG (Kyoto Encyclopedia of Genes and Genomes) counterpart, hsa00521) are identified as a result of genes shared with the sugar metabolism group (six out of eight genes in the streptomycin biosynthesis set are paralogs of genes in the glycolysis pathway). This metabolic shift toward sugar metabolism is not surprising since it is known that cancer cells in general [10,11], and glioblastoma cells (...truncated)