Knowledge-guided gene ranking by coordinative component analysis
Chen Wang
0
Jianhua Xuan
0
Huai Li
Yue Wang
0
Ming Zhan
Eric P Hoffman
Robert Clarke
0
Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University
,
Arlington, VA
,
USA
Background: In cancer, gene networks and pathways often exhibit dynamic behavior, particularly during the process of carcinogenesis. Thus, it is important to prioritize those genes that are strongly associated with the functionality of a network. Traditional statistical methods are often inept to identify biologically relevant member genes, motivating researchers to incorporate biological knowledge into gene ranking methods. However, current integration strategies are often heuristic and fail to incorporate fully the true interplay between biological knowledge and gene expression data. Results: To improve knowledge-guided gene ranking, we propose a novel method called coordinative component analysis (COCA) in this paper. COCA explicitly captures those genes within a specific biological context that are likely to be expressed in a coordinative manner. Formulated as an optimization problem to maximize the coordinative effort, COCA is designed to first extract the coordinative components based on a partial guidance from knowledge genes and then rank the genes according to their participation strengths. An embedded bootstrapping procedure is implemented to improve statistical robustness of the solutions. COCA was initially tested on simulation data and then on published gene expression microarray data to demonstrate its improved performance as compared to traditional statistical methods. Finally, the COCA approach has been applied to stem cell data to identify biologically relevant genes in signaling pathways. As a result, the COCA approach uncovers novel pathway members that may shed light into the pathway deregulation in cancers. Conclusion: We have developed a new integrative strategy to combine biological knowledge and microarray data for gene ranking. The method utilizes knowledge genes for a guidance to first extract coordinative components, and then rank the genes according to their contribution related to a network or pathway. The experimental results show that such a knowledge-guided strategy can provide context-specific gene ranking with an improved performance in pathway member identification.
-
Background
It is of great interest to identify genes strongly associated
with the functionality of gene networks or signal
transduction pathways particularly from gene expression
microarray data. Two of the earliest approaches to
identify such genes are fold-change and multiple t-testing;
each aims to rank the genes in the order of their
differential expressions under various experimental conditions.
Many improvements to the original t-test method have
been proposed for microarray data analysis. For example,
significant analysis of microarray (SAM) [1] uses a
modified t-statistic with an added estimator for gene ranking
in which the false discovery rate (FDR) is estimated by a
permutation procedure. A bootstrapped p-value
approach was introduced in [2] to address the inherent
variability in small sample studies. Prior studies have
shown that fold-change is more robust than t-test with
respect to the reproducibility of gene rankings [3], while
other researchers argue that better reproducibility does
not guarantee the accuracy of gene ranking[4].
Nonetheless, both methods are severely limited because they
neglect the interaction among genes, prioritizing gene
relevance only based on individual gene expression
values.
To address the above-mentioned problem, several gene
ranking methods have been proposed to either consider
the joint effect of genes or to explore the expression
pattern in time-course data. For instance, Opgen-Rhein &
Strimmer [5] introduced the "shrinkage t" statistic that is
based on a novel and model-free shrinkage estimate of
the variance vector across genes. Storey et al. [6]
proposed a method (EDGE) to first fit the time-course
expression pattern by splines, and then rank genes by
hypothesis testing on the spline parameters. Furlanello et
al. [7] proposed a classification-based feature elimination
scheme to rank genes by iteratively discarding chunks of
genes showing least contribution to the classifier.
In contrast, other investigators have proposed
incorporating biological knowledge for gene ranking. GeneRank
[8] ranks genes by integrating gene expression and
network structure derived from gene annotations. Ma et al.
[9] proposed a strategy to combine gene expression and
protein-protein interaction (PPI) knowledge, ranking
genes by their association with phenotype calibrated by
the PPI information. However, such data integration,
while widely adopted, is usually done in a heuristic way
and lacks an objective estimate of the true interplay
between biological knowledge and gene expression data.
In this paper, we propose a knowledge-guided gene
ranking scheme, namely a coordinative component
analysis (COCA) algorithm, to model explicitly those genes
that are most likely to be expressed in a coordinative
manner within a specific biological context. We consider
the genes that belong to a pathway or a network as a
whole, rather than treating genes as independent or
individual measures. To enhance the biological relevance of
gene ranking, gene organization requires that the
intrinsic coordination among the genes be defined by biological
knowledge. Specifically, biological knowledge, which
could be the gene sets within a biological pathway or
subnetwork derived from relevant biological databases, is
used to guide the algorithm. Thus, we can address the
conditional specificity of biological context, for example,
where the deregulation of a network only occurs under
specific conditions. We rank each individual gene by
evaluating its participation or involvement in the pathway of
interest, when projected onto the coordinative direction
learned by the COCA algorithm. In COCA, a
bootstrapping procedure is also implemented to improve the
statistical robustness of the ranking results. We demonstrate
that the COCA approach can provide an improved
performance as compared to traditional statistical methods
using simulation data and published gene expression
microarray data including yeast cell cycle data and stem
cell time-course data, indicating its effectiveness for
incorporating biological knowledge into gene ranking.
Methods
A flowchart of the proposed approach is shown in Figure
1. Given a gene expression microarray data set, multiple
Figure 1 A flowchart of the proposed approach, namely
knowledge-guided coordinative component analysis (COCA), for gene
ranking. A bootstrapping procedure is designed to increase the
confidence in estimating the coordinative component (W) and
participation vector (A).
data sets are first generated through bootstrap
resampling of the genes in the array. The bootstrapping
procedure is used to overcome the over-fitting (...truncated)