Seeking unique and common biological themes in multiple gene lists or datasets: pathway pattern extraction pipeline for pathway-level comparative analysis
BMC Bioinformatics
Software Seeking unique and common biological themes in multiple gene lists or datasets: pathway pattern extraction pipeline for pathway-level comparative analysis
Ming Yi 0
Uma Mudunuri 0
Anney Che 0
Robert M Stephens 0
0 Address: Advanced Biomedical Computing Center, Advanced Technology Program, SAIC-Frederick Inc, NCI-Frederick , Frederick, MD 21702 , USA
Background: One of the challenges in the analysis of microarray data is to integrate and compare the selected (e.g., differential) gene lists from multiple experiments for common or unique underlying biological themes. A common way to approach this problem is to extract common genes from these gene lists and then subject these genes to enrichment analysis to reveal the underlying biology. However, the capacity of this approach is largely restricted by the limited number of common genes shared by datasets from multiple experiments, which could be caused by the complexity of the biological system itself. Results: We now introduce a new Pathway Pattern Extraction Pipeline (PPEP), which extends the existing WPS application by providing a new pathway-level comparative analysis scheme. To facilitate comparing and correlating results from different studies and sources, PPEP contains new interfaces that allow evaluation of the pathway-level enrichment patterns across multiple gene lists. As an exploratory tool, this analysis pipeline may help reveal the underlying biological themes at both the pathway and gene levels. The analysis scheme provided by PPEP begins with multiple gene lists, which may be derived from different studies in terms of the biological contexts, applied technologies, or methodologies. These lists are then subjected to pathway-level comparative analysis for extraction of pathway-level patterns. This analysis pipeline helps to explore the commonality or uniqueness of these lists at the level of pathways or biological processes from different but relevant biological systems using a combination of statistical enrichment measurements, pathway-level pattern extraction, and graphical display of the relationships of genes and their associated pathways as Gene-Term Association Networks (GTANs) within the WPS platform. As a proof of concept, we have used the new method to analyze many datasets from our collaborators as well as some public microarray datasets. Conclusion: This tool provides a new pathway-level analysis scheme for integrative and comparative analysis of data derived from different but relevant systems. The tool is freely available as a Pathway Pattern Extraction Pipeline implemented in our existing software package WPS, which can be obtained at http://www.abcc.ncifcrf.gov/wps/wps_index.php
-
Background
Microarray and other high throughput (HTP)
technologies have exponentially increased in popularity in recent
years and consequently have generated tremendous
amounts of data. This data provides great opportunities
for systems-level understanding of the underlying
biological themes of complex experiments. As a result, a wide
range of software tools that process and analyze the data
using different approaches and algorithms have been
developed including clustering methods (e.g., hierarchical
[1]; K-means [2], SOM [3]) methods, pattern extraction
method [4], identifying differential gene lists from two or
more classes contrasts (e.g., Significance Analysis of
Microarray [5], LPE [6], and analysis of variance (ANOVA)
related methods [7,8]). In order to place these patterned
or differential genes into their biological contexts, they
can be mapped into pathways or networks for further
analysis of biological associations and relationships
among them as well as other documented relevant genes
curated from the literature [4,9-11]. Alternatively,
functional group or gene set overrepresentation analysis
(ORA) methods [12-14] or gene set-based enrichment
analysis (e.g., GSEA [15,16]) can be used to identify the
significantly affected pathways or gene sets that are
enriched or over-represented within a list of patterned or
differential genes.
The enormous increase in availability of data from studies
using similar biological systems, independent samplings,
and/or technical platforms (e.g., microarray, proteomics)
allows an integrative and comparative analysis to be
performed. This provides for a deeper understanding of the
underlying biology and consolidation of initial
observations made from individual studies. Furthermore, the
systems-oriented approach allows for additional insights
from combined datasets. For example, diseases such as
prostate cancer have been studied by many different
groups. These data from different platforms and
independent samplings provide the opportunity not only to
assess the consensus of these studies and the variation
levels of the patient population, but also to perform
integrative analysis for signatures at both gene and pathway level.
A conventional way to integrate and compare multiple
experiments derived from independent research groups or
even different technologies for common or unique
underlying biological themes is to derive common genes
amongst them before subjecting them to enrichment
analysis to reveal the underlying biology. However, such
an approach often encounters limitations caused by the
diversity of technologies and the complexity of the
biological system itself.
Many of the available software tools primarily retrieve
expression patterns at the individual gene level and
generate a list of genes that are differentially expressed or have
certain expression patterns across samples. Even the
software tools that employ ORA or GSEA methods [14-16],
usually consider only one or very limited number of gene
lists at a time. Although a great deal of attention was
placed on gene-level expression patterns initially, there is
an urgent need for capturing the pathway-level patterns
that may represent the common or unique biological
themes, which are embedded in multiple genes lists or
multiple datasets from different, but related studies.
It has become more and more evident that gene level
signatures or classifiers that can consistently characterize
different tumor types are relatively hard to validate across
different studies due to the complexity of the underlying
biology (e.g., large genetic variations within the
phenotypical population), experimental variation, and even the
choice of data processing (e.g., normalization,
transformation) and analysis methods/algorithms. This
observation likely results from the fact that many complex
diseases including cancers, heart disease and hypertension
have been shown to be caused by mutations in multiple
genes in the same or related pathways [17-20]. While
biologically relevant genes may consistently behave in
correlation with an associated phenotype across a population,
it is more likely that common pathways can be impacted
through distinct gene events that are not reflected at the
individual gene level. (...truncated)