Discovering cancer genes by integrating network and functional properties (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1755-8794-2-61.pdf

Discovering cancer genes by integrating network and functional properties

Li Li 1 Kangyu Zhang 1 James Lee 0 Shaun Cordes 0 David P Davis 0 Zhijun Tang 1 0 Department of Molecular Biology, Genentech Inc. , 1 DNA Way, South San Francisco, CA 94080 , USA 1 Department of Bioinformatics, Genentech Inc. , 1 DNA Way, South San Francisco, CA 94080 , USA Background: Identification of novel cancer-causing genes is one of the main goals in cancer research. The rapid accumulation of genome-wide protein-protein interaction (PPI) data in humans has provided a new basis for studying the topological features of cancer genes in cellular networks. It is important to integrate multiple genomic data sources, including PPI networks, protein domains and Gene Ontology (GO) annotations, to facilitate the identification of cancer genes. Methods: Topological features of the PPI network, as well as protein domain compositions, enrichment of gene ontology categories, sequence and evolutionary conservation features were extracted and compared between cancer genes and other genes. The predictive power of various classifiers for identification of cancer genes was evaluated by cross validation. Experimental validation of a subset of the prediction results was conducted using siRNA knockdown and viability assays in human colon cancer cell line DLD-1. Results: Cross validation demonstrated advantageous performance of classifiers based on support vector machines (SVMs) with the inclusion of the topological features from the PPI network, protein domain compositions and GO annotations. We then applied the trained SVM classifier to human genes to prioritize putative cancer genes. siRNA knock-down of several SVM predicted cancer genes displayed greatly reduced cell viability in human colon cancer cell line DLD-1. Conclusion: Topological features of PPI networks, protein domain compositions and GO annotations are good predictors of cancer genes. The SVM classifier integrates multiple features and as such is useful for prioritizing candidate cancer genes for experimental validations. - Background Cancer is a complex disease whose multi-step progression involves alteration of many genes, including tumor suppressor genes and oncogenes. Although multiple targeted cancer therapeutic agents have been developed based on several known cancer genes, it is expected that many cancer genes remain to be identified [1]. Identification of novel genes likely to be involved in cancer is important for understanding the disease mechanism and development of cancer therapeutics. Recently, efforts in global genomic re-sequencing have been made to identify novel cancer genes by detecting somatic mutations in tumor tissues [24]. However, it is challenging to distinguish true cancerassociated mutations from a large amount of "passenger" variants detected in these studies that are likely to be irrelevant to cancer progression. Most gene products interact in complex cellular networks. It was proposed that direct and indirect interactions often occur between protein pairs whose mutations are attributable to similar disease phenotypes. This concept was utilized to predict phenotypic effects of gene mutations using protein complexes [5] and identify previously unknown complexes likely to be associated with disease [6,7]. Similar notion may be applied to cancer where identifying protein interaction network of known cancer genes may provide an efficient way to discover novel cancer genes. The rapid accumulation of genome-wide human PPI data has provided a new basis for studying the topological features of cancer genes. It was shown that the network properties in human protein-protein interaction (PPI) data, such as network connectivity, differ between cancer causing genes [1] and other genes in the genome [8]. An interactome-transcriptome analysis also reported increased interaction connectivity of differentially expressed genes in lung squamous cancer tissues [9]. These studies indicated a central role of cancer proteins within the interactome. Recent studies also applied network approaches to studying cancer signaling [10] and identifying biomarkers of cancer progression in specific cancer types [11,12]. However, the utility of PPI network for identification of novel genes whose genetic alterations are likely to be causally implicated in oncogenesis remains to be demonstrated. In addition, efforts have been made to use functional and sequence characteristics, such as GO annotation and sequence conservation, to predict cancer genes and cancer mutations [13,14]. However, a systematic analysis of all these features side-by-side is needed to evaluate their merits, both individually and in combination, in cancer gene prediction. In this study, we took a machine learning approach to investigate various network and functional properties of known cancer genes to predict the likelihood of a gene to be involved in cancer. Although Cancer Gene Census provides a catalogue of currently known cancer causing mutations, many other cancer genes may be yet to be discovered from the rest of the genome. To reduce the false positives in classifying genes not involved in cancer, we extended the comparison of various features in four non-overlapping gene groups, i.e. "cancer genes" from the Cancer Gene Census (bona fide cancer genes whose mutations are causally implicated in cancers) [1], "COSMIC genes" profiled for somatic mutations in cancer and deposited into the Catalogue Of Somatic Mutations In Cancer (COSMIC) database [15] (excluding those in the cancer gene set), "OMIM genes" from the Online Mendelian Inheritance in Man (OMIM) database [16] (excluding those in the cancer or COSMIC gene set), and other genes in the genome (noted as "non-cancer genes"). Somatic mutations were observed for a subset of "COSMIC genes" in cancers and they are potentially related to oncogenesis while "OMIM genes" contain known genes involved in diseases other than known cancer genes. We trained various classifiers using "cancer genes" and "non-cancer genes", and evaluated the contribution of various features and different classification methods using cross validation. We then applied the trained classifier with the best cross validation performance to human genes to prioritize human genes likely to be involved in cancer. To evaluate the roles of predicted cancer genes in cancer cell growth and proliferation, siRNA knock-down experiments and cell viability assays were conducted in human colorectal cancer cell line. Methods Datasets PPI network was constructed as the union of all relationships obtained from representative published datasets [8,17,18]. Sequence features were obtained from NCBI Entrez database [19]. The number of alternative transcripts for each Entrez gene was obtained from the RefSeq database. Non-synonymous mutation rate Ka and synonymous mutation rate Ks of human-mouse and human-rat orthologs were retrieved from NCBI HomoloGene database ftp://ftp.ncbi.nih.gov/pub/HomoloGene/. We constructed four non-overlapping gene (...truncated)