Simultaneous inference of biological networks of multiple species from genome-wide data and evolutionary information: a semi-supervised approach (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/25/22/2962.full.pdf

Simultaneous inference of biological networks of multiple species from genome-wide data and evolutionary information: a semi-supervised approach

Hisashi Kashima 0 2 Yoshihiro Yamanishi 1 6 Tsuyoshi Kato 5 Masashi Sugiyama 4 Koji Tsuda 3 Associate Editor: Jonathan Wren 0 Present address: Department of Mathematical Informatics, Graduate School of Information Science and Technology, The University of Tokyo , 7-3-1 Hongo, Bunkyo-ku, 113-8656 Tokyo , Japan 1 Mines ParisTech, Centre for Computational Biology , 35 rue Saint-Honore, F-77305 Fontainebleau Cedex , France 2 IBM Research, Tokyo Research Laboratory , 1623-14 Shimo-tsuruma, Yamato, Kanagawa, 242-8502 Japan 3 National Institute of Advanced Industrial Science and Technology, Computational Biology Research Center (AIST) , 2-42 Aomi, Koto-ku, Tokyo 135-0064 , Japan 4 Tokyo Institute of Technology, Department of Computer Science , 2-12-1, O-okayama, Meguro-ku, Tokyo 152-8552 5 Ochanomizu University, Center for Informational Biology , 2-1-1 Ohtsuka, Bunkyo-ku, Tokyo 112-8610 6 INSERM, U900 , F-75248, Paris , France Motivation: The existing supervised methods for biological network inference work on each of the networks individually based only on intra-species information such as gene expression data. We believe that it will be more effective to use genomic data and cross-species evolutionary information from different species simultaneously, rather than to use the genomic data alone. Results: We created a new semi-supervised learning method called Link Propagation for inferring biological networks of multiple species based on genome-wide data and evolutionary information. The new method was applied to simultaneous reconstruction of three metabolic networks of Caenorhabditis elegans, Helicobacter pylori and Saccharomyces cerevisiae, based on gene expression similarities and amino acid sequence similarities. The experimental results proved that the new simultaneous network inference method consistently improves the predictive performance over the individual network inferences, and it also outperforms in accuracy and speed other established methods such as the pairwise support vector machine. Availability: The software and data are available at http://cbio.ensmp.fr/~yyamanishi/LinkPropagation/. Contact: Supplementary information: Supplementary data are available at Bioinformatics online. - INTRODUCTION Most biological functions involve the coordinated actions of many proteins in the cell, and the complexity of biological systems arises as a result of such interactions. It is therefore important to understand biological systems through the analysis of the relationships amongst many proteins. The functional behaviors of proteins in the biological system can be represented by a graph with the proteins as nodes and with their functional interactions as edges. Examples of such biological networks include metabolic networks, protein protein interaction networks, gene regulatory networks and signaling networks. A grand challenge in recent computational biology is to infer the structures of such biological networks from various genomic data and molecular information. Recent developments in biotechnology have contributed to an increasing amount of highthroughput experimental data on transcriptome and proteome. Such datasets are useful sources to computationally infer large biological networks. To infer the biological network of a species of interest, there are two possible information sources, intra-species information such as genomic data and cross-species information such as evolutionary data. The first type of information includes genomic or experimental data about genes or proteins of the target species, for example, gene order information for the chromosomes of bacterial genomes (Overbeek et al., 1999), phylogenetic profiles (Pellegrini et al., 1999), and gene expression patterns (Kharchenko et al., 2004). Recently, a variety of supervised statistical methods for inferring biological networks based on the integration of these types of data have been developed based on dimension reduction and binary classification framework, and they have been tailored to infer how likely the existence of each link is within the protein set. Examples of metric learning include kernel canonical correlation analysis (Yamanishi et al., 2004), dimension reduction (Yamanishi et al., 2005) and the em-algorithm (Kato et al., 2005). A typical binary classification framework such as a support vector machine with pairwise kernels (P-SVM; BenHur and Noble, 2005) takes pairs of proteins as inputs for the classifiers. Owing to the applicability to many biological networks and their good predictive performance, the supervised network inference methods are becoming popular tools in bioinformatics and computational biology. However, they require considerable computational resources and they suffer from serious scalability problems. For example, the time complexity of the quadratic programming problem for the P-SVM is O(m6), where m is the number of proteins in the largest network, and even worse, the space complexity is O(m4), which is just for storing the kernel matrix. The other type of information that can be used for inferring biological networks is evolutionary information about the conservation of protein interactions, called interlog (Matthews et al., 2001; Walhout et al., 2000). This concept is based on the assumption that, if protein interacts with protein in one species, then their orthologous proteins and in other species are likely to interact with each other. This idea is used not only for inferring physical proteinprotein interactions but also for automatic metabolic pathway reconstructions for the fully sequenced genomes (Moriya et al., 2007). However, the interlog approach cannot work if significant sequence homology is not detected across different species, which means the number of detectable interactions is limited and it is impossible to predict species-specific interactions in any biological network. To date, the genome-databased approach and the evolutionary-information-based approach have been studied separately for inferring biological networks (Kato et al., 2005; Matthews et al., 2001; Walhout et al., 2000; Yamanishi et al., 2004, 2005), but both kinds of information should be complementary to each other for prediction reliability. Recall that all of the existing supervised network inference methods are supposed to infer an individual biological network from genomic data for each species. Therefore, we believe that it will be more effective to use genomic data and evolutionary information simultaneously, rather than to use genomic data alone, in the framework of supervised network inference. The effectiveness was confirmed for a gene regulatory network in an unsupervised context (Tamada et al., 2005). In this article, we describe a new semi-supervised learning method that we call Link Propagation for inferring the biological networks of multiple species from genome-wide data and evolutionary information. While the existing methods infer each of the networks individual (...truncated)