A computational framework for the prioritization of disease-gene candidates
Browne et al. BMC Genomics 2015, 16(Suppl 9):S2
http://www.biomedcentral.com/1471-2164/16/S9/S2
RESEARCH
Open Access
A computational framework for the prioritization
of disease-gene candidates
Fiona Browne*†, Haiying Wang*, Huiru Zheng*
From IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2014)
Belfast, UK. 2-5 November 2014
Abstract
Background: The identification of genes and uncovering the role they play in diseases is an important and
complex challenge. Genome-wide linkage and association studies have made advancements in identifying genetic
variants that underpin human disease. An important challenge now is to identify meaningful disease-associated
genes from a long list of candidate genes implicated by these analyses. The application of gene prioritization can
enhance our understanding of disease mechanisms and aid in the discovery of drug targets. The integration of
protein-protein interaction networks along with disease datasets and contextual information is an important tool in
unraveling the molecular basis of diseases.
Results: In this paper we propose a computational pipeline for the prioritization of disease-gene candidates.
Diverse heterogeneous data including: gene-expression, protein-protein interaction network, ontology-based
similarity and topological measures and tissue-specific are integrated. The pipeline was applied to prioritize
Alzheimer’s Disease (AD) genes, whereby a list of 32 prioritized genes was generated. This approach correctly
identified key AD susceptible genes: PSEN1 and TRAF1. Biological process enrichment analysis revealed the
prioritized genes are modulated in AD pathogenesis including: regulation of neurogenesis and generation of
neurons. Relatively high predictive performance (AUC: 0.70) was observed when classifying AD and normal gene
expression profiles from individuals using leave-one-out cross validation.
Conclusions: This work provides a foundation for future investigation of diverse heterogeneous data integration
for disease-gene prioritization.
Background
The rapid accumulation of high-throughput data along
with advances in network biology have been fundamental in improving our knowledge of biological systems
and complex disease. The emergence of network medicine [1] has explored disease complexity through the
systematic identification of disease pathways and modules. Via the analysis of network topology and dynamics,
key discoveries have been made including identification
of novel disease genes and pathways, biomarkers and
drug targets for disease [2]. Network theory is making
* Correspondence: ; ; h.
† Contributed equally
Computer Science Research Institution, School of Computing and
Mathematics, University of Ulster, Northern Ireland, UK
important contributions to the topological study of biological networks, such as Protein-Protein Interaction
Networks (PPIN) [3]. The study by Xu et al. [4] analyzed topological features of a PPIN and observed that
hereditary disease-genes from the Online Mendelian
Inheritance in Man (OMIM) database [5] have a larger
degree and tendency to interact with other diseasegenes in literature curated networks. Both Chuang et al.
[6] and Taylor et al. [7] have indicated that the alterations in the physical interaction network may be a indicator of breast cancer prognosis. Goh et al. [8]
demonstrated that the majority of disease genes are
nonessential and located in the periphery of functional
networks. Research by [9] discovered that genes connected to diseases with similar phenotypes are more
likely to interact directly with each other.
© 2015 Browne et al. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://
creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the
original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/
zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Browne et al. BMC Genomics 2015, 16(Suppl 9):S2
http://www.biomedcentral.com/1471-2164/16/S9/S2
Identification of candidate genes associated with physiological disorders are a fundamental task in the analysis of
complex diseases [10]. Genome-wide association studies
and linkage analysis have been pivotal in the identification
of candidate genes, however, the large list of resultant
genes returned are time-consuming and expensive to analyze [11]. The availability of high-throughput molecular
interaction network provides and the application of network analysis tools such as clustering or graph partitioning
have proved valuable in disease gene prioritization [12].
For instance, PPIN data integrated with genome-wide
expression profiles using DNA arrays and/or next generation sequencing enabling the modeling of networks have
aided our understanding of how biological networks operate. A number of computational approaches to prioritize
candidate genes have been proposed including: ToppGene
[13] and GeneWanderer [14] which rank candidate genes
based on known associations with disease genes using
diverse data sources and methodology. The study by
Vanunu et al. [15] applied a diffusion- based method
named PRINCE to prioritize genes in prostate cancer, AD
and type 2 diabetes. Wu et al. proposed the resource
AlignPI [16] which applied a network alignment approach
predict disease genes. The algorithm VAVIEN [17] was
also developed to prioritize disease genes based on topological features of PPINs.
These diverse studies confirm the importance and need
of improving methods to integrate diverse ‘omic’ sources
to uncover candidate disease genes in biological systems.
To address this need, we have developed a prioritization
pipeline, which integrates diverse heterogeneous information. We illustrate the implementation of this framework
using Alzheimer’s Disease (AD) as a Case Study. AD is the
most common neurodegenerative disease which is both
genetically complex and heterogeneous. Pathological characteristics of AD include presence of amyloid peptide plaques, mature senile plaques and neurofibrillary tangles and
loss of neurons in conjunction with the presence of oxidative stress [18]. AD can be divided into two categories
early onset AD (EOAD) (patients < 65) and late onset AD
(LOAD) (patients > 65). A set of gene mutations including
APP, PSEN1 and PSEN2 involved in the amyloid beta and
tau pathways have been associated with hereditary AD.
Using genome-wide association studies, Lambert et al. [19]
identified the gene encoding APOE in LOAD as a risk factor along with 11 new loci. Furthermore, studies have suggested that AD is a multifactorial disease in which many
pathways are involved. This highlights the progress, which
has been made in determining the genetic underpinnings
of AD. However, there is further need for an understanding of AD mechanisms to develop more specific di (...truncated)