Disease gene identification by random walk on multigraphs merging heterogeneous genomic and phenotype data

BMC Genomics, Dec 2012

High throughput experiments resulted in many genomic datasets and hundreds of candidate disease genes. To discover the real disease genes from a set of candidate genes, computational methods have been proposed and worked on various types of genomic data sources. As a single source of genomic data is prone of bias, incompleteness and noise, integration of different genomic data sources is highly demanded to accomplish reliable disease gene identification. In contrast to the commonly adapted data integration approach which integrates separate lists of candidate genes derived from the each single data sources, we merge various genomic networks into a multigraph which is capable of connecting multiple edges between a pair of nodes. This novel approach provides a data platform with strong noise tolerance to prioritize the disease genes. A new idea of random walk is then developed to work on multigraphs using a modified step to calculate the transition matrix. Our method is further enhanced to deal with heterogeneous data types by allowing cross-walk between phenotype and gene networks. Compared on benchmark datasets, our method is shown to be more accurate than the state-of-the-art methods in disease gene identification. We also conducted a case study to identify disease genes for Insulin-Dependent Diabetes Mellitus. Some of the newly identified disease genes are supported by recently published literature. The proposed RWRM (Random Walk with Restart on Multigraphs) model and CHN (Complex Heterogeneous Network) model are effective in data integration for candidate gene prioritization.

Article PDF cannot be displayed. You can download it here:

https://bmcgenomics.biomedcentral.com/track/pdf/10.1186/1471-2164-13-S7-S27

Disease gene identification by random walk on multigraphs merging heterogeneous genomic and phenotype data

Li and Li BMC Genomics 2012, 13(Suppl 7):S27 http://www.biomedcentral.com/1471-2164/13/S7/S27 PROCEEDINGS Open Access Disease gene identification by random walk on multigraphs merging heterogeneous genomic and phenotype data Yongjin Li1*, Jinyan Li2* From Asia Pacific Bioinformatics Network (APBioNet) Eleventh International Conference on Bioinformatics (InCoB2012) Bangkok, Thailand. 3-5 October 2012 Abstract Background: High throughput experiments resulted in many genomic datasets and hundreds of candidate disease genes. To discover the real disease genes from a set of candidate genes, computational methods have been proposed and worked on various types of genomic data sources. As a single source of genomic data is prone of bias, incompleteness and noise, integration of different genomic data sources is highly demanded to accomplish reliable disease gene identification. Results: In contrast to the commonly adapted data integration approach which integrates separate lists of candidate genes derived from the each single data sources, we merge various genomic networks into a multigraph which is capable of connecting multiple edges between a pair of nodes. This novel approach provides a data platform with strong noise tolerance to prioritize the disease genes. A new idea of random walk is then developed to work on multigraphs using a modified step to calculate the transition matrix. Our method is further enhanced to deal with heterogeneous data types by allowing cross-walk between phenotype and gene networks. Compared on benchmark datasets, our method is shown to be more accurate than the state-of-the-art methods in disease gene identification. We also conducted a case study to identify disease genes for Insulin-Dependent Diabetes Mellitus. Some of the newly identified disease genes are supported by recently published literature. Conclusions: The proposed RWRM (Random Walk with Restart on Multigraphs) model and CHN (Complex Heterogeneous Network) model are effective in data integration for candidate gene prioritization. Background Reliable identification of disease genes is an important task in biomedical research useful to find out the mechanism of a disease and to reveal therapeutic targets. Family based genetic linkage analysis has been widely conducted to determine regions in the chromosomes of a genome which have large genetic effects on a disease [1]. Each susceptible region in the chromosomes is called a susceptible locus which may cover dozens even hundreds * Correspondence: ; 1 Center for Systems Biology, University of Texas at Dallas, USA 2 Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology, Sydney, Australia Full list of author information is available at the end of the article of genes. Those genes in a susceptible locus are candidate disease genes which can be further narrowed down to the real disease genes by computational or experimental experiments. At the Online Mendelian Inheritance in Man (OMIM) database [2] which stores the latest data obtained by linkage analysis, there are still thousands of disease loci in which the real disease-causing genes have not been identified. Sophisticated computational algorithms have been recently proposed to prioritize those candidate genes [3-7] to deal with this problem. However, most of the algorithms are based on single data source. As a single data source is prone of bias, incompleteness and noise, integration of various genomic data sources is highly demanded for reliable prioritization of a © 2012 Li et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Li and Li BMC Genomics 2012, 13(Suppl 7):S27 http://www.biomedcentral.com/1471-2164/13/S7/S27 set of candidate genes. The top ranked candidate genes are then most likely to be the real disease genes. A commonly adapted data integration approach is to integrate separate lists of candidate genes derived from the each single data sources. A notable example is ENDEAVOUR [8], by which nine data sources were handled including sequence data, gene annotation data, etc. It was implemented in a rank aggregation based integration (RABI) framework consisting of two stages. In the first stage, a rank list of candidate genes is determined according to their similarity to known disease genes based on each data source. Subsequently, these rank lists are integrated into one rank list by using N-dimensional order statistics (NDOS) [9]. In an earlier work [10], we improved the performance of ENDEAVOUR by using a random walk with restart (RWR) in the first stage as the ranking algorithm, and using a discounted rating system (DRS) in the second stage to combine the ranked lists. Merging separate lists of candidate disease genes derived from single data sources with bias and noise can inflate the uncertainties in the data and may propagate into the final ranking. To address this problem, it’s better to eliminate the bias and noise by merging the single data sources into an integrated data source, and then to prioritize a set of candidate genes. This work proposes a novel integration method to merge various genomic networks into a multigraph which is capable of connecting multiple edges between a pair of nodes. We then operate a random walk on the multigraph to find disease genes. Many random walk models have been introduced to solve different kinds of problems in bioinformatics recently. For example, Köhler et al. [11] used the RWR algorithm to prioritize candidate genes. Macropol et al. [12] proposed a repeated random walks algorithm to predict protein complex from the PPI network. Nibbe et al. [13] used random walk models to identify disease-relevant subnetworks from the PPI network, and studied a crosstalk between them. However, none of these models can work on multigraphs as the multiple edges between a pair of nodes complicates the calculation of transition probabilities. In this work, we first construct separate gene networks corresponding to different data sources, and then merge these networks into a single network defined by a multigraph. When our random walk algorithm runs on the merged network, the transition probability is proposed to be calculated as the expected value of the transition probabilities from the multiple networks. Our algorithm was compared with four RABI models [8,10]. On a benchmark data set covering 36 genetic diseases [11], our proposed algorithm achieved AUC value of 89.4%, much higher than the four RABI models. Our method is named RWRM (Random Walk with Restart on Multigraphs). This work is further deepened by additionally considering phenotype data. It is widely understood that phenotype Page 2 of 12 information can be used to improve the discovery of dis (...truncated)


This is a preview of a remote PDF: https://bmcgenomics.biomedcentral.com/track/pdf/10.1186/1471-2164-13-S7-S27
Article home page: https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-13-S7-S27

Yongjin Li, Jinyan Li. Disease gene identification by random walk on multigraphs merging heterogeneous genomic and phenotype data, BMC Genomics, 2012, pp. S27, Volume 13, Issue 7, DOI: 10.1186/1471-2164-13-S7-S27