Seeking unique and common biological themes in multiple gene lists or datasets: pathway pattern extraction pipeline for pathway-level comparative analysis (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2105-10-200.pdf

Seeking unique and common biological themes in multiple gene lists or datasets: pathway pattern extraction pipeline for pathway-level comparative analysis

BMC Bioinformatics Software Seeking unique and common biological themes in multiple gene lists or datasets: pathway pattern extraction pipeline for pathway-level comparative analysis Ming Yi 0 Uma Mudunuri 0 Anney Che 0 Robert M Stephens 0 0 Address: Advanced Biomedical Computing Center, Advanced Technology Program, SAIC-Frederick Inc, NCI-Frederick , Frederick, MD 21702 , USA Background: One of the challenges in the analysis of microarray data is to integrate and compare the selected (e.g., differential) gene lists from multiple experiments for common or unique underlying biological themes. A common way to approach this problem is to extract common genes from these gene lists and then subject these genes to enrichment analysis to reveal the underlying biology. However, the capacity of this approach is largely restricted by the limited number of common genes shared by datasets from multiple experiments, which could be caused by the complexity of the biological system itself. Results: We now introduce a new Pathway Pattern Extraction Pipeline (PPEP), which extends the existing WPS application by providing a new pathway-level comparative analysis scheme. To facilitate comparing and correlating results from different studies and sources, PPEP contains new interfaces that allow evaluation of the pathway-level enrichment patterns across multiple gene lists. As an exploratory tool, this analysis pipeline may help reveal the underlying biological themes at both the pathway and gene levels. The analysis scheme provided by PPEP begins with multiple gene lists, which may be derived from different studies in terms of the biological contexts, applied technologies, or methodologies. These lists are then subjected to pathway-level comparative analysis for extraction of pathway-level patterns. This analysis pipeline helps to explore the commonality or uniqueness of these lists at the level of pathways or biological processes from different but relevant biological systems using a combination of statistical enrichment measurements, pathway-level pattern extraction, and graphical display of the relationships of genes and their associated pathways as Gene-Term Association Networks (GTANs) within the WPS platform. As a proof of concept, we have used the new method to analyze many datasets from our collaborators as well as some public microarray datasets. Conclusion: This tool provides a new pathway-level analysis scheme for integrative and comparative analysis of data derived from different but relevant systems. The tool is freely available as a Pathway Pattern Extraction Pipeline implemented in our existing software package WPS, which can be obtained at http://www.abcc.ncifcrf.gov/wps/wps_index.php - Background Microarray and other high throughput (HTP) technologies have exponentially increased in popularity in recent years and consequently have generated tremendous amounts of data. This data provides great opportunities for systems-level understanding of the underlying biological themes of complex experiments. As a result, a wide range of software tools that process and analyze the data using different approaches and algorithms have been developed including clustering methods (e.g., hierarchical [1]; K-means [2], SOM [3]) methods, pattern extraction method [4], identifying differential gene lists from two or more classes contrasts (e.g., Significance Analysis of Microarray [5], LPE [6], and analysis of variance (ANOVA) related methods [7,8]). In order to place these patterned or differential genes into their biological contexts, they can be mapped into pathways or networks for further analysis of biological associations and relationships among them as well as other documented relevant genes curated from the literature [4,9-11]. Alternatively, functional group or gene set overrepresentation analysis (ORA) methods [12-14] or gene set-based enrichment analysis (e.g., GSEA [15,16]) can be used to identify the significantly affected pathways or gene sets that are enriched or over-represented within a list of patterned or differential genes. The enormous increase in availability of data from studies using similar biological systems, independent samplings, and/or technical platforms (e.g., microarray, proteomics) allows an integrative and comparative analysis to be performed. This provides for a deeper understanding of the underlying biology and consolidation of initial observations made from individual studies. Furthermore, the systems-oriented approach allows for additional insights from combined datasets. For example, diseases such as prostate cancer have been studied by many different groups. These data from different platforms and independent samplings provide the opportunity not only to assess the consensus of these studies and the variation levels of the patient population, but also to perform integrative analysis for signatures at both gene and pathway level. A conventional way to integrate and compare multiple experiments derived from independent research groups or even different technologies for common or unique underlying biological themes is to derive common genes amongst them before subjecting them to enrichment analysis to reveal the underlying biology. However, such an approach often encounters limitations caused by the diversity of technologies and the complexity of the biological system itself. Many of the available software tools primarily retrieve expression patterns at the individual gene level and generate a list of genes that are differentially expressed or have certain expression patterns across samples. Even the software tools that employ ORA or GSEA methods [14-16], usually consider only one or very limited number of gene lists at a time. Although a great deal of attention was placed on gene-level expression patterns initially, there is an urgent need for capturing the pathway-level patterns that may represent the common or unique biological themes, which are embedded in multiple genes lists or multiple datasets from different, but related studies. It has become more and more evident that gene level signatures or classifiers that can consistently characterize different tumor types are relatively hard to validate across different studies due to the complexity of the underlying biology (e.g., large genetic variations within the phenotypical population), experimental variation, and even the choice of data processing (e.g., normalization, transformation) and analysis methods/algorithms. This observation likely results from the fact that many complex diseases including cancers, heart disease and hypertension have been shown to be caused by mutations in multiple genes in the same or related pathways [17-20]. While biologically relevant genes may consistently behave in correlation with an associated phenotype across a population, it is more likely that common pathways can be impacted through distinct gene events that are not reflected at the individual gene level. (...truncated)