Discovering cancer genes by integrating network and functional properties
Li Li
1
Kangyu Zhang
1
James Lee
0
Shaun Cordes
0
David P Davis
0
Zhijun Tang
1
0
Department of Molecular Biology, Genentech Inc.
,
1 DNA Way, South San Francisco, CA 94080
,
USA
1
Department of Bioinformatics, Genentech Inc.
,
1 DNA Way, South San Francisco, CA 94080
,
USA
Background: Identification of novel cancer-causing genes is one of the main goals in cancer research. The rapid accumulation of genome-wide protein-protein interaction (PPI) data in humans has provided a new basis for studying the topological features of cancer genes in cellular networks. It is important to integrate multiple genomic data sources, including PPI networks, protein domains and Gene Ontology (GO) annotations, to facilitate the identification of cancer genes. Methods: Topological features of the PPI network, as well as protein domain compositions, enrichment of gene ontology categories, sequence and evolutionary conservation features were extracted and compared between cancer genes and other genes. The predictive power of various classifiers for identification of cancer genes was evaluated by cross validation. Experimental validation of a subset of the prediction results was conducted using siRNA knockdown and viability assays in human colon cancer cell line DLD-1. Results: Cross validation demonstrated advantageous performance of classifiers based on support vector machines (SVMs) with the inclusion of the topological features from the PPI network, protein domain compositions and GO annotations. We then applied the trained SVM classifier to human genes to prioritize putative cancer genes. siRNA knock-down of several SVM predicted cancer genes displayed greatly reduced cell viability in human colon cancer cell line DLD-1. Conclusion: Topological features of PPI networks, protein domain compositions and GO annotations are good predictors of cancer genes. The SVM classifier integrates multiple features and as such is useful for prioritizing candidate cancer genes for experimental validations.
-
Background
Cancer is a complex disease whose multi-step progression
involves alteration of many genes, including tumor
suppressor genes and oncogenes. Although multiple targeted
cancer therapeutic agents have been developed based on
several known cancer genes, it is expected that many
cancer genes remain to be identified [1]. Identification of
novel genes likely to be involved in cancer is important for
understanding the disease mechanism and development
of cancer therapeutics. Recently, efforts in global genomic
re-sequencing have been made to identify novel cancer
genes by detecting somatic mutations in tumor tissues
[24]. However, it is challenging to distinguish true
cancerassociated mutations from a large amount of "passenger"
variants detected in these studies that are likely to be
irrelevant to cancer progression.
Most gene products interact in complex cellular networks.
It was proposed that direct and indirect interactions often
occur between protein pairs whose mutations are
attributable to similar disease phenotypes. This concept was
utilized to predict phenotypic effects of gene mutations
using protein complexes [5] and identify previously
unknown complexes likely to be associated with disease
[6,7]. Similar notion may be applied to cancer where
identifying protein interaction network of known cancer
genes may provide an efficient way to discover novel
cancer genes. The rapid accumulation of genome-wide
human PPI data has provided a new basis for studying the
topological features of cancer genes. It was shown that the
network properties in human protein-protein interaction
(PPI) data, such as network connectivity, differ between
cancer causing genes [1] and other genes in the genome
[8]. An interactome-transcriptome analysis also reported
increased interaction connectivity of differentially
expressed genes in lung squamous cancer tissues [9].
These studies indicated a central role of cancer proteins
within the interactome. Recent studies also applied
network approaches to studying cancer signaling [10] and
identifying biomarkers of cancer progression in specific
cancer types [11,12]. However, the utility of PPI network
for identification of novel genes whose genetic alterations
are likely to be causally implicated in oncogenesis
remains to be demonstrated. In addition, efforts have
been made to use functional and sequence characteristics,
such as GO annotation and sequence conservation, to
predict cancer genes and cancer mutations [13,14].
However, a systematic analysis of all these features side-by-side
is needed to evaluate their merits, both individually and
in combination, in cancer gene prediction.
In this study, we took a machine learning approach to
investigate various network and functional properties of
known cancer genes to predict the likelihood of a gene to
be involved in cancer. Although Cancer Gene Census
provides a catalogue of currently known cancer causing
mutations, many other cancer genes may be yet to be
discovered from the rest of the genome. To reduce the
false positives in classifying genes not involved in cancer,
we extended the comparison of various features in four
non-overlapping gene groups, i.e. "cancer genes" from the
Cancer Gene Census (bona fide cancer genes whose
mutations are causally implicated in cancers) [1], "COSMIC
genes" profiled for somatic mutations in cancer and
deposited into the Catalogue Of Somatic Mutations In
Cancer (COSMIC) database [15] (excluding those in the
cancer gene set), "OMIM genes" from the Online
Mendelian Inheritance in Man (OMIM) database [16] (excluding
those in the cancer or COSMIC gene set), and other genes
in the genome (noted as "non-cancer genes"). Somatic
mutations were observed for a subset of "COSMIC genes"
in cancers and they are potentially related to oncogenesis
while "OMIM genes" contain known genes involved in
diseases other than known cancer genes. We trained
various classifiers using "cancer genes" and "non-cancer
genes", and evaluated the contribution of various features
and different classification methods using cross
validation. We then applied the trained classifier with the best
cross validation performance to human genes to prioritize
human genes likely to be involved in cancer. To evaluate
the roles of predicted cancer genes in cancer cell growth
and proliferation, siRNA knock-down experiments and
cell viability assays were conducted in human colorectal
cancer cell line.
Methods
Datasets
PPI network was constructed as the union of all
relationships obtained from representative published datasets
[8,17,18]. Sequence features were obtained from NCBI
Entrez database [19]. The number of alternative
transcripts for each Entrez gene was obtained from the RefSeq
database. Non-synonymous mutation rate Ka and
synonymous mutation rate Ks of human-mouse and human-rat
orthologs were retrieved from NCBI HomoloGene
database ftp://ftp.ncbi.nih.gov/pub/HomoloGene/.
We constructed four non-overlapping gene (...truncated)