Predicting gene function using similarity learning
Phuong and Nhung BMC Genomics 2013, 14(Suppl 4):S4
http://www.biomedcentral.com/1471-2164/14/S4/S4
RESEARCH
Open Access
Predicting gene function using similarity learning
Tu Minh Phuong1*, Ngo Phuong Nhung2
From IEEE International Conference on Bioinformatics and Biomedicine 2012
Philadelphia, PA, USA. 4-7 October 2012
Abstract
Background: Computational methods that make use of heterogeneous biological datasets to predict gene
function provide a cost-effective and rapid way for annotating genomes. A common framework shared by many
such methods is to construct a combined functional association network from multiple networks representing
different sources of data, and use this combined network as input to network-based or kernel-based learning
algorithms. In these methods, a key factor contributing to the prediction accuracy is the network quality, which is
the ability of the network to reflect the functional relatedness of gene pairs. To improve the network quality, a
large effort has been spent on developing methods for network integration. These methods, however, produce
networks, which then remain unchanged, and nearly no effort has been made to optimize the networks after their
construction.
Results: Here, we propose an alternative method to improve the network quality. The proposed method takes as
input a combined network produced by an existing network integration algorithm, and reconstructs this network
to better represent the co-functionality relationships between gene pairs. At the core of the method is a learning
algorithm that can learn a measure of functional similarity between genes, which we then use to reconstruct the
input network. In experiments with yeast and human, the proposed method produced improved networks and
achieved more accurate results than two other leading gene function prediction approaches.
Conclusions: The results show that it is possible to improve the accuracy of network-based gene function
prediction methods by optimizing combined networks with appropriate similarity measures learned from data. The
proposed learning procedure can handle noisy training data and scales well to large genomes.
Background
The increasing number of sequenced genomes makes it
important to develop methods that can assign functions
to newly discovered genes in a timely and cost-effective
manner. Traditional laboratory methods, while accurate
and reliable, would require enormous effort and time to
identify functions for every gene. Computational
approaches that utilize diverse biological datasets to
generate automated predictions are useful in this situation as they can guide laboratory experiments and facilitate more rapid annotation of genomes.
Existing computational approaches to gene function
prediction have relied on a variety of genomic and
* Correspondence:
1
Department of Computer Science, Posts & Telecommunications Institute of
Technology, Hanoi, Viet Nam
Full list of author information is available at the end of the article
proteomic data. Exploiting the similarities between DNA
or protein sequences to infer gene function was the first
approach tested and has been the most widely used
approach to date. Later, the usefulness of other types of
genomic and proteomic data in this problem is also
proved. Researchers have used microarray expression
data [1], protein 3D structures [2], protein domain configuration [3], protein-protein interaction networks [4],
and phylogenetic profiles [5] to predict functions of
genes. Recently, inferring gene function simultaneously
from different types of biological data has been shown
to deliver more accurate predictions and has attracted
considerable research interests [6-16].
Many methods for inferring functions of genes from
heterogeneous datasets share a common framework in
which a functional association between genes is first constructed and then used as input for learning algorithms.
© 2013 Phuong and Nhung; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
Phuong and Nhung BMC Genomics 2013, 14(Suppl 4):S4
http://www.biomedcentral.com/1471-2164/14/S4/S4
A functional association can be represented as a network
with nodes corresponding to genes and edges representing the co-functionalities of gene pairs. In such a network, each edge is usually assigned a weight representing
the strength of the co-functionality relationship between
the gene pair. A network of this kind is typical constructed in two steps. First, each dataset is used to create
an individual network that captures the co-functionality
of gene pairs, as implied by this dataset. For vectorial
data, one can calculate edge weights as the similarity
scores between genes using appropriate similarity metrics,
for example the Pearson correlation coefficient, and then
form the networks by means of neighboring node connections. Data already given in forms of networks, for
example protein-protein interactions, are used directly.
The second step constructs a single combined association
network by integrating the individual ones. A strategy
commonly used in this step is to form the combined network as a weighted sum of individual ones. Here, each
network is weighted according to its usefulness in predicting annotations for a group of genes that share a
known specific function. Previous studies have used various regression or other learning based algorithms to estimate network weights.
Given a functional association network, the next step
is to use this network to propagate functional labels
from a group of annotated genes to other genes. There
are two main types of approaches for this step.
Approaches of the first type create a kernel function
from the co-functionality relationships encoded in the
network and use this kernel with kernel-based classification algorithms [8,9,17]. In such approaches, genes with
known annotations serve as labeled examples for training. Approaches of the second type use graph-based
algorithms, which propagate labels from annotated
genes to other genes based on graph proximity. Methods in this group range from simple nearest neighbor
counting algorithms [16], to more sophisticated statistical methods such as graph-based semi-supervised learning algorithms [9], and Markov random fields [18] (see
[19] for a more complete list of methods). On a number
of benchmark datasets, graph-based and kernel-based
approaches have shown comparable prediction accuracy,
but graph-based approaches are generally faster [11,20].
The prediction accuracy of both graph-based and kernel-based approaches largely depends on the ability of
the network to capture the functional associations
between genes. To improve the network quality, previous studies have focused on improving the integration
step, or more (...truncated)