Simultaneous inference of biological networks of multiple species from genome-wide data and evolutionary information: a semi-supervised approach
Hisashi Kashima
0
2
Yoshihiro Yamanishi
1
6
Tsuyoshi Kato
5
Masashi Sugiyama
4
Koji Tsuda
3
Associate Editor: Jonathan Wren
0
Present address: Department of Mathematical Informatics, Graduate School of Information Science and Technology, The University of Tokyo
,
7-3-1 Hongo, Bunkyo-ku, 113-8656 Tokyo
,
Japan
1
Mines ParisTech, Centre for Computational Biology
,
35 rue Saint-Honore, F-77305 Fontainebleau Cedex
,
France
2
IBM Research, Tokyo Research Laboratory
,
1623-14 Shimo-tsuruma, Yamato, Kanagawa, 242-8502
Japan
3
National Institute of Advanced Industrial Science and Technology, Computational Biology Research Center (AIST)
,
2-42 Aomi, Koto-ku, Tokyo 135-0064
,
Japan
4
Tokyo Institute of Technology, Department of Computer Science
,
2-12-1, O-okayama, Meguro-ku, Tokyo 152-8552
5
Ochanomizu University, Center for Informational Biology
,
2-1-1 Ohtsuka, Bunkyo-ku, Tokyo 112-8610
6
INSERM, U900
,
F-75248, Paris
,
France
Motivation: The existing supervised methods for biological network inference work on each of the networks individually based only on intra-species information such as gene expression data. We believe that it will be more effective to use genomic data and cross-species evolutionary information from different species simultaneously, rather than to use the genomic data alone. Results: We created a new semi-supervised learning method called Link Propagation for inferring biological networks of multiple species based on genome-wide data and evolutionary information. The new method was applied to simultaneous reconstruction of three metabolic networks of Caenorhabditis elegans, Helicobacter pylori and Saccharomyces cerevisiae, based on gene expression similarities and amino acid sequence similarities. The experimental results proved that the new simultaneous network inference method consistently improves the predictive performance over the individual network inferences, and it also outperforms in accuracy and speed other established methods such as the pairwise support vector machine. Availability: The software and data are available at http://cbio.ensmp.fr/~yyamanishi/LinkPropagation/. Contact: Supplementary information: Supplementary data are available at Bioinformatics online.
-
INTRODUCTION
Most biological functions involve the coordinated actions of many
proteins in the cell, and the complexity of biological systems arises
as a result of such interactions. It is therefore important to understand
biological systems through the analysis of the relationships amongst
many proteins. The functional behaviors of proteins in the biological
system can be represented by a graph with the proteins as nodes
and with their functional interactions as edges. Examples of
such biological networks include metabolic networks, protein
protein interaction networks, gene regulatory networks and signaling
networks. A grand challenge in recent computational biology is
to infer the structures of such biological networks from various
genomic data and molecular information. Recent developments in
biotechnology have contributed to an increasing amount of
highthroughput experimental data on transcriptome and proteome. Such
datasets are useful sources to computationally infer large biological
networks.
To infer the biological network of a species of interest, there
are two possible information sources, intra-species information
such as genomic data and cross-species information such as
evolutionary data. The first type of information includes genomic or
experimental data about genes or proteins of the target species, for
example, gene order information for the chromosomes of bacterial
genomes (Overbeek et al., 1999), phylogenetic profiles (Pellegrini
et al., 1999), and gene expression patterns (Kharchenko et al.,
2004). Recently, a variety of supervised statistical methods
for inferring biological networks based on the integration of
these types of data have been developed based on dimension
reduction and binary classification framework, and they have
been tailored to infer how likely the existence of each link is
within the protein set. Examples of metric learning include kernel
canonical correlation analysis (Yamanishi et al., 2004), dimension
reduction (Yamanishi et al., 2005) and the em-algorithm (Kato
et al., 2005). A typical binary classification framework such as
a support vector machine with pairwise kernels (P-SVM;
BenHur and Noble, 2005) takes pairs of proteins as inputs for the
classifiers. Owing to the applicability to many biological networks
and their good predictive performance, the supervised network
inference methods are becoming popular tools in bioinformatics
and computational biology. However, they require considerable
computational resources and they suffer from serious scalability
problems. For example, the time complexity of the quadratic
programming problem for the P-SVM is O(m6), where m is the
number of proteins in the largest network, and even worse, the space
complexity is O(m4), which is just for storing the kernel matrix.
The other type of information that can be used for inferring
biological networks is evolutionary information about the
conservation of protein interactions, called interlog (Matthews
et al., 2001; Walhout et al., 2000). This concept is based on
the assumption that, if protein interacts with protein in one
species, then their orthologous proteins and in other species
are likely to interact with each other. This idea is used not only
for inferring physical proteinprotein interactions but also for
automatic metabolic pathway reconstructions for the fully sequenced
genomes (Moriya et al., 2007). However, the interlog approach
cannot work if significant sequence homology is not detected
across different species, which means the number of detectable
interactions is limited and it is impossible to predict species-specific
interactions in any biological network. To date, the
genome-databased approach and the evolutionary-information-based approach
have been studied separately for inferring biological networks (Kato
et al., 2005; Matthews et al., 2001; Walhout et al., 2000; Yamanishi
et al., 2004, 2005), but both kinds of information should be
complementary to each other for prediction reliability. Recall that all
of the existing supervised network inference methods are supposed
to infer an individual biological network from genomic data for each
species. Therefore, we believe that it will be more effective to use
genomic data and evolutionary information simultaneously, rather
than to use genomic data alone, in the framework of supervised
network inference. The effectiveness was confirmed for a gene
regulatory network in an unsupervised context (Tamada et al.,
2005).
In this article, we describe a new semi-supervised learning method
that we call Link Propagation for inferring the biological networks
of multiple species from genome-wide data and evolutionary
information. While the existing methods infer each of the networks
individual (...truncated)