Mining breast cancer genes with a network based noise-tolerant approach
Nie and Yu BMC Systems Biology 2013, 7:49
http://www.biomedcentral.com/1752-0509/7/49
RESEARCH ARTICLE
Open Access
Mining breast cancer genes with a network based
noise-tolerant approach
Yaling Nie and Jingkai Yu*
Abstract
Background: Mining novel breast cancer genes is an important task in breast cancer research. Many approaches
prioritize candidate genes based on their similarity to known cancer genes, usually by integrating multiple data
sources. However, different types of data often contain varying degrees of noise. For effective data integration, it’s
important to design methods that work robustly with respect to noise.
Results: Gene Ontology (GO) annotations were often utilized in cancer gene mining works. However, the vast majority
of GO annotations were computationally derived, thus not completely accurate. A set of genes annotated with breast
cancer enriched GO terms was adopted here as a set of source data with realistic noise. A novel noise tolerant approach
was proposed to rank candidate breast cancer genes using noisy source data within the framework of a comprehensive
human Protein-Protein Interaction (PPI) network. Performance of the proposed method was quantitatively evaluated by
comparing it with the more established random walk approach. Results showed that the proposed method exhibited
better performance in ranking known breast cancer genes and higher robustness against data noise than the random
walk approach. When noise started to increase, the proposed method was able to maintained relatively stable
performance, while the random walk approach showed drastic performance decline; when noise increased to a large
extent, the proposed method was still able to achieve better performance than random walk did.
Conclusions: A novel noise tolerant method was proposed to mine breast cancer genes. Compared to the well
established random walk approach, it showed better performance in correctly ranking cancer genes and worked
robustly with respect to noise within source data. To the best of our knowledge, it’s the first such effort to
quantitatively analyze noise tolerance between different breast cancer gene mining methods. The sorted gene list can
be valuable for breast cancer research. The proposed quantitative noise analysis method may also prove useful for
other data integration efforts. It is hoped that the current work can lead to more discussions about influence of data
noise on different computational methods for mining disease genes.
Keywords: Network, Breast cancer, Data noise, Noise tolerance
Background
Novel disease genes remain difficult to identify in most
genetic diseases, and in particular, in highly polygenic
disorders. Currently, not all genes have yet been detected
even for those diseases whose molecular mechanisms
are partially known [1], for instance, breast cancer [2].
Breast cancer is a common cancer and a major cause of
cancer death among females around the world, which
makes up 23% of total cancer cases and 14% of cancer
deaths [3]. Mining breast cancer genes is conducive to
understand its pathogenic mechanism and search for
* Correspondence:
National Key Laboratory of Biochemical Engineering, Institute of Process
Engineering, Chinese Academy of Sciences, Beijing 100190, China
effective treatments. With rapid growth of disease-related
genomic and functional data, computational approaches
can be utilized to mine for new cancer genes [4].
In the past two decades, a number of computational
methods had been developed to mine potential disease
related genes. Most of those methods rank candidate
genes based on the idea that proteins similar to each other
tend to cause similar or same diseases [5]. They involve
setting up a candidate gene set to be compared with a
known disease gene set on their physical or functional
attributes [6]. On one hand, physical attribute-based
methods include screening direct neighbors of known
disease genes in the PPI network [7,8], comparing shortest
path length [9] between candidate genes and known
© 2013 Nie and Yu; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
Nie and Yu BMC Systems Biology 2013, 7:49
http://www.biomedcentral.com/1752-0509/7/49
disease genes, clustering or graph partitioning to uncover
disease modules in the interaction network [10-12]. Some
approaches also used global network features to find genes
similar with known disease genes [13,14]. On the other
hand, several methods rely on functional similarities
between candidate and disease genes [15], for example,
some methods measured similarity between genes by
their functional annotations [16] (e.g., Gene Ontology
(GO) [17]). Methods using other data sources had also
been developed, such as gene expression [18,19], biological
pathways and sequence features [20].
Cancers such as breast cancer are complex and heterogeneous in nature, cancer-related genes often do not
function in isolation but interact with one another [5].
Integrating multiple data types was found to be effective for
gene mining in alleviating problems caused by incomplete
information [21-23]. For instance, ENDEAVOUR [24] is
an online tool based on using multiple data sources. It
integrated candidate gene rankings from different data
sources into a final ranking with the order statistic algorithm. However, different data categories usually contain
inherent noise or systematic errors [25]. For instance, data
from computational predictions will no doubt contain
some amount of uncertainty. Experimental data obtained
from different labs or experimental platforms can contain
appreciable amount of noise. Noise in source data can
push computed results away from their true values, lead
to erroneous reporting.
A better method must be able to tolerate certain
amount of noise, which makes the integration of different
data sources more applicable to real-life scenarios. Despite
the fact that some approaches can work with precision
when presented with highly accurate data, few studies
have shown that those methods worked robustly when
faced with increasingly noisy data. A number of papers
had discussed the task of balancing noise and precision
when using multiple data sources for cancer gene mining,
however, hardly anyone had analyzed the noise problem
quantitatively [26-29]. It is important to calibrate how
robust a method works with respect to noise, namely, how
fast a method deteriorates when percentage of noise in
source data goes up. With that knowledge, users can then
be confident about the method’s effectiveness when it is
applied to real life data sets.
To tackle the data noise problem, a novel noise tolerant
data fusion approach was proposed here for breast cancer
gene mining (Figure 1), which integrated information
fro (...truncated)