Disease gene identification by random walk on multigraphs merging heterogeneous genomic and phenotype data
Li and Li BMC Genomics 2012, 13(Suppl 7):S27
http://www.biomedcentral.com/1471-2164/13/S7/S27
PROCEEDINGS
Open Access
Disease gene identification by random walk on
multigraphs merging heterogeneous genomic
and phenotype data
Yongjin Li1*, Jinyan Li2*
From Asia Pacific Bioinformatics Network (APBioNet) Eleventh International Conference on Bioinformatics
(InCoB2012)
Bangkok, Thailand. 3-5 October 2012
Abstract
Background: High throughput experiments resulted in many genomic datasets and hundreds of candidate disease
genes. To discover the real disease genes from a set of candidate genes, computational methods have been
proposed and worked on various types of genomic data sources. As a single source of genomic data is prone of
bias, incompleteness and noise, integration of different genomic data sources is highly demanded to accomplish
reliable disease gene identification.
Results: In contrast to the commonly adapted data integration approach which integrates separate lists of
candidate genes derived from the each single data sources, we merge various genomic networks into a multigraph
which is capable of connecting multiple edges between a pair of nodes. This novel approach provides a data
platform with strong noise tolerance to prioritize the disease genes. A new idea of random walk is then developed
to work on multigraphs using a modified step to calculate the transition matrix. Our method is further enhanced
to deal with heterogeneous data types by allowing cross-walk between phenotype and gene networks. Compared
on benchmark datasets, our method is shown to be more accurate than the state-of-the-art methods in disease
gene identification. We also conducted a case study to identify disease genes for Insulin-Dependent Diabetes
Mellitus. Some of the newly identified disease genes are supported by recently published literature.
Conclusions: The proposed RWRM (Random Walk with Restart on Multigraphs) model and CHN (Complex
Heterogeneous Network) model are effective in data integration for candidate gene prioritization.
Background
Reliable identification of disease genes is an important
task in biomedical research useful to find out the
mechanism of a disease and to reveal therapeutic targets.
Family based genetic linkage analysis has been widely
conducted to determine regions in the chromosomes of a
genome which have large genetic effects on a disease [1].
Each susceptible region in the chromosomes is called a
susceptible locus which may cover dozens even hundreds
* Correspondence: ;
1
Center for Systems Biology, University of Texas at Dallas, USA
2
Advanced Analytics Institute, Faculty of Engineering and IT, University of
Technology, Sydney, Australia
Full list of author information is available at the end of the article
of genes. Those genes in a susceptible locus are candidate
disease genes which can be further narrowed down to the
real disease genes by computational or experimental
experiments. At the Online Mendelian Inheritance in
Man (OMIM) database [2] which stores the latest data
obtained by linkage analysis, there are still thousands of
disease loci in which the real disease-causing genes have
not been identified. Sophisticated computational algorithms have been recently proposed to prioritize those
candidate genes [3-7] to deal with this problem. However, most of the algorithms are based on single data
source. As a single data source is prone of bias, incompleteness and noise, integration of various genomic data
sources is highly demanded for reliable prioritization of a
© 2012 Li et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Li and Li BMC Genomics 2012, 13(Suppl 7):S27
http://www.biomedcentral.com/1471-2164/13/S7/S27
set of candidate genes. The top ranked candidate genes
are then most likely to be the real disease genes.
A commonly adapted data integration approach is to
integrate separate lists of candidate genes derived from the
each single data sources. A notable example is ENDEAVOUR [8], by which nine data sources were handled
including sequence data, gene annotation data, etc. It was
implemented in a rank aggregation based integration
(RABI) framework consisting of two stages. In the first
stage, a rank list of candidate genes is determined according to their similarity to known disease genes based on
each data source. Subsequently, these rank lists are integrated into one rank list by using N-dimensional order statistics (NDOS) [9]. In an earlier work [10], we improved
the performance of ENDEAVOUR by using a random
walk with restart (RWR) in the first stage as the ranking
algorithm, and using a discounted rating system (DRS) in
the second stage to combine the ranked lists.
Merging separate lists of candidate disease genes derived
from single data sources with bias and noise can inflate
the uncertainties in the data and may propagate into the
final ranking. To address this problem, it’s better to eliminate the bias and noise by merging the single data sources
into an integrated data source, and then to prioritize a set
of candidate genes. This work proposes a novel integration
method to merge various genomic networks into a multigraph which is capable of connecting multiple edges
between a pair of nodes. We then operate a random walk
on the multigraph to find disease genes. Many random
walk models have been introduced to solve different kinds
of problems in bioinformatics recently. For example, Köhler et al. [11] used the RWR algorithm to prioritize candidate genes. Macropol et al. [12] proposed a repeated
random walks algorithm to predict protein complex from
the PPI network. Nibbe et al. [13] used random walk models to identify disease-relevant subnetworks from the PPI
network, and studied a crosstalk between them. However,
none of these models can work on multigraphs as the
multiple edges between a pair of nodes complicates the
calculation of transition probabilities.
In this work, we first construct separate gene networks
corresponding to different data sources, and then merge
these networks into a single network defined by a multigraph. When our random walk algorithm runs on the
merged network, the transition probability is proposed to
be calculated as the expected value of the transition
probabilities from the multiple networks. Our algorithm
was compared with four RABI models [8,10]. On a
benchmark data set covering 36 genetic diseases [11], our
proposed algorithm achieved AUC value of 89.4%, much
higher than the four RABI models. Our method is named
RWRM (Random Walk with Restart on Multigraphs).
This work is further deepened by additionally considering phenotype data. It is widely understood that phenotype
Page 2 of 12
information can be used to improve the discovery of dis (...truncated)