A computational framework for the prioritization of disease-gene candidates (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2164-16-S9-S2.pdf

A computational framework for the prioritization of disease-gene candidates

Browne et al. BMC Genomics 2015, 16(Suppl 9):S2 http://www.biomedcentral.com/1471-2164/16/S9/S2 RESEARCH Open Access A computational framework for the prioritization of disease-gene candidates Fiona Browne*†, Haiying Wang*, Huiru Zheng* From IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2014) Belfast, UK. 2-5 November 2014 Abstract Background: The identification of genes and uncovering the role they play in diseases is an important and complex challenge. Genome-wide linkage and association studies have made advancements in identifying genetic variants that underpin human disease. An important challenge now is to identify meaningful disease-associated genes from a long list of candidate genes implicated by these analyses. The application of gene prioritization can enhance our understanding of disease mechanisms and aid in the discovery of drug targets. The integration of protein-protein interaction networks along with disease datasets and contextual information is an important tool in unraveling the molecular basis of diseases. Results: In this paper we propose a computational pipeline for the prioritization of disease-gene candidates. Diverse heterogeneous data including: gene-expression, protein-protein interaction network, ontology-based similarity and topological measures and tissue-specific are integrated. The pipeline was applied to prioritize Alzheimer’s Disease (AD) genes, whereby a list of 32 prioritized genes was generated. This approach correctly identified key AD susceptible genes: PSEN1 and TRAF1. Biological process enrichment analysis revealed the prioritized genes are modulated in AD pathogenesis including: regulation of neurogenesis and generation of neurons. Relatively high predictive performance (AUC: 0.70) was observed when classifying AD and normal gene expression profiles from individuals using leave-one-out cross validation. Conclusions: This work provides a foundation for future investigation of diverse heterogeneous data integration for disease-gene prioritization. Background The rapid accumulation of high-throughput data along with advances in network biology have been fundamental in improving our knowledge of biological systems and complex disease. The emergence of network medicine [1] has explored disease complexity through the systematic identification of disease pathways and modules. Via the analysis of network topology and dynamics, key discoveries have been made including identification of novel disease genes and pathways, biomarkers and drug targets for disease [2]. Network theory is making * Correspondence: ; ; h. † Contributed equally Computer Science Research Institution, School of Computing and Mathematics, University of Ulster, Northern Ireland, UK important contributions to the topological study of biological networks, such as Protein-Protein Interaction Networks (PPIN) [3]. The study by Xu et al. [4] analyzed topological features of a PPIN and observed that hereditary disease-genes from the Online Mendelian Inheritance in Man (OMIM) database [5] have a larger degree and tendency to interact with other diseasegenes in literature curated networks. Both Chuang et al. [6] and Taylor et al. [7] have indicated that the alterations in the physical interaction network may be a indicator of breast cancer prognosis. Goh et al. [8] demonstrated that the majority of disease genes are nonessential and located in the periphery of functional networks. Research by [9] discovered that genes connected to diseases with similar phenotypes are more likely to interact directly with each other. © 2015 Browne et al. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http:// creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/ zero/1.0/) applies to the data made available in this article, unless otherwise stated. Browne et al. BMC Genomics 2015, 16(Suppl 9):S2 http://www.biomedcentral.com/1471-2164/16/S9/S2 Identification of candidate genes associated with physiological disorders are a fundamental task in the analysis of complex diseases [10]. Genome-wide association studies and linkage analysis have been pivotal in the identification of candidate genes, however, the large list of resultant genes returned are time-consuming and expensive to analyze [11]. The availability of high-throughput molecular interaction network provides and the application of network analysis tools such as clustering or graph partitioning have proved valuable in disease gene prioritization [12]. For instance, PPIN data integrated with genome-wide expression profiles using DNA arrays and/or next generation sequencing enabling the modeling of networks have aided our understanding of how biological networks operate. A number of computational approaches to prioritize candidate genes have been proposed including: ToppGene [13] and GeneWanderer [14] which rank candidate genes based on known associations with disease genes using diverse data sources and methodology. The study by Vanunu et al. [15] applied a diffusion- based method named PRINCE to prioritize genes in prostate cancer, AD and type 2 diabetes. Wu et al. proposed the resource AlignPI [16] which applied a network alignment approach predict disease genes. The algorithm VAVIEN [17] was also developed to prioritize disease genes based on topological features of PPINs. These diverse studies confirm the importance and need of improving methods to integrate diverse ‘omic’ sources to uncover candidate disease genes in biological systems. To address this need, we have developed a prioritization pipeline, which integrates diverse heterogeneous information. We illustrate the implementation of this framework using Alzheimer’s Disease (AD) as a Case Study. AD is the most common neurodegenerative disease which is both genetically complex and heterogeneous. Pathological characteristics of AD include presence of amyloid peptide plaques, mature senile plaques and neurofibrillary tangles and loss of neurons in conjunction with the presence of oxidative stress [18]. AD can be divided into two categories early onset AD (EOAD) (patients < 65) and late onset AD (LOAD) (patients > 65). A set of gene mutations including APP, PSEN1 and PSEN2 involved in the amyloid beta and tau pathways have been associated with hereditary AD. Using genome-wide association studies, Lambert et al. [19] identified the gene encoding APOE in LOAD as a risk factor along with 11 new loci. Furthermore, studies have suggested that AD is a multifactorial disease in which many pathways are involved. This highlights the progress, which has been made in determining the genetic underpinnings of AD. However, there is further need for an understanding of AD mechanisms to develop more specific di (...truncated)