Mining breast cancer genes with a network based noise-tolerant approach (pdf)

Article PDF cannot be displayed. You can download it here:

https://bmcsystbiol.biomedcentral.com/track/pdf/10.1186/1752-0509-7-49

Mining breast cancer genes with a network based noise-tolerant approach

Nie and Yu BMC Systems Biology 2013, 7:49 http://www.biomedcentral.com/1752-0509/7/49 RESEARCH ARTICLE Open Access Mining breast cancer genes with a network based noise-tolerant approach Yaling Nie and Jingkai Yu* Abstract Background: Mining novel breast cancer genes is an important task in breast cancer research. Many approaches prioritize candidate genes based on their similarity to known cancer genes, usually by integrating multiple data sources. However, different types of data often contain varying degrees of noise. For effective data integration, it’s important to design methods that work robustly with respect to noise. Results: Gene Ontology (GO) annotations were often utilized in cancer gene mining works. However, the vast majority of GO annotations were computationally derived, thus not completely accurate. A set of genes annotated with breast cancer enriched GO terms was adopted here as a set of source data with realistic noise. A novel noise tolerant approach was proposed to rank candidate breast cancer genes using noisy source data within the framework of a comprehensive human Protein-Protein Interaction (PPI) network. Performance of the proposed method was quantitatively evaluated by comparing it with the more established random walk approach. Results showed that the proposed method exhibited better performance in ranking known breast cancer genes and higher robustness against data noise than the random walk approach. When noise started to increase, the proposed method was able to maintained relatively stable performance, while the random walk approach showed drastic performance decline; when noise increased to a large extent, the proposed method was still able to achieve better performance than random walk did. Conclusions: A novel noise tolerant method was proposed to mine breast cancer genes. Compared to the well established random walk approach, it showed better performance in correctly ranking cancer genes and worked robustly with respect to noise within source data. To the best of our knowledge, it’s the first such effort to quantitatively analyze noise tolerance between different breast cancer gene mining methods. The sorted gene list can be valuable for breast cancer research. The proposed quantitative noise analysis method may also prove useful for other data integration efforts. It is hoped that the current work can lead to more discussions about influence of data noise on different computational methods for mining disease genes. Keywords: Network, Breast cancer, Data noise, Noise tolerance Background Novel disease genes remain difficult to identify in most genetic diseases, and in particular, in highly polygenic disorders. Currently, not all genes have yet been detected even for those diseases whose molecular mechanisms are partially known [1], for instance, breast cancer [2]. Breast cancer is a common cancer and a major cause of cancer death among females around the world, which makes up 23% of total cancer cases and 14% of cancer deaths [3]. Mining breast cancer genes is conducive to understand its pathogenic mechanism and search for * Correspondence: National Key Laboratory of Biochemical Engineering, Institute of Process Engineering, Chinese Academy of Sciences, Beijing 100190, China effective treatments. With rapid growth of disease-related genomic and functional data, computational approaches can be utilized to mine for new cancer genes [4]. In the past two decades, a number of computational methods had been developed to mine potential disease related genes. Most of those methods rank candidate genes based on the idea that proteins similar to each other tend to cause similar or same diseases [5]. They involve setting up a candidate gene set to be compared with a known disease gene set on their physical or functional attributes [6]. On one hand, physical attribute-based methods include screening direct neighbors of known disease genes in the PPI network [7,8], comparing shortest path length [9] between candidate genes and known © 2013 Nie and Yu; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Nie and Yu BMC Systems Biology 2013, 7:49 http://www.biomedcentral.com/1752-0509/7/49 disease genes, clustering or graph partitioning to uncover disease modules in the interaction network [10-12]. Some approaches also used global network features to find genes similar with known disease genes [13,14]. On the other hand, several methods rely on functional similarities between candidate and disease genes [15], for example, some methods measured similarity between genes by their functional annotations [16] (e.g., Gene Ontology (GO) [17]). Methods using other data sources had also been developed, such as gene expression [18,19], biological pathways and sequence features [20]. Cancers such as breast cancer are complex and heterogeneous in nature, cancer-related genes often do not function in isolation but interact with one another [5]. Integrating multiple data types was found to be effective for gene mining in alleviating problems caused by incomplete information [21-23]. For instance, ENDEAVOUR [24] is an online tool based on using multiple data sources. It integrated candidate gene rankings from different data sources into a final ranking with the order statistic algorithm. However, different data categories usually contain inherent noise or systematic errors [25]. For instance, data from computational predictions will no doubt contain some amount of uncertainty. Experimental data obtained from different labs or experimental platforms can contain appreciable amount of noise. Noise in source data can push computed results away from their true values, lead to erroneous reporting. A better method must be able to tolerate certain amount of noise, which makes the integration of different data sources more applicable to real-life scenarios. Despite the fact that some approaches can work with precision when presented with highly accurate data, few studies have shown that those methods worked robustly when faced with increasingly noisy data. A number of papers had discussed the task of balancing noise and precision when using multiple data sources for cancer gene mining, however, hardly anyone had analyzed the noise problem quantitatively [26-29]. It is important to calibrate how robust a method works with respect to noise, namely, how fast a method deteriorates when percentage of noise in source data goes up. With that knowledge, users can then be confident about the method’s effectiveness when it is applied to real life data sets. To tackle the data noise problem, a novel noise tolerant data fusion approach was proposed here for breast cancer gene mining (Figure 1), which integrated information fro (...truncated)