Knowledge-guided gene ranking by coordinative component analysis (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2105-11-162.pdf

Knowledge-guided gene ranking by coordinative component analysis

Chen Wang 0 Jianhua Xuan 0 Huai Li Yue Wang 0 Ming Zhan Eric P Hoffman Robert Clarke 0 Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University , Arlington, VA , USA Background: In cancer, gene networks and pathways often exhibit dynamic behavior, particularly during the process of carcinogenesis. Thus, it is important to prioritize those genes that are strongly associated with the functionality of a network. Traditional statistical methods are often inept to identify biologically relevant member genes, motivating researchers to incorporate biological knowledge into gene ranking methods. However, current integration strategies are often heuristic and fail to incorporate fully the true interplay between biological knowledge and gene expression data. Results: To improve knowledge-guided gene ranking, we propose a novel method called coordinative component analysis (COCA) in this paper. COCA explicitly captures those genes within a specific biological context that are likely to be expressed in a coordinative manner. Formulated as an optimization problem to maximize the coordinative effort, COCA is designed to first extract the coordinative components based on a partial guidance from knowledge genes and then rank the genes according to their participation strengths. An embedded bootstrapping procedure is implemented to improve statistical robustness of the solutions. COCA was initially tested on simulation data and then on published gene expression microarray data to demonstrate its improved performance as compared to traditional statistical methods. Finally, the COCA approach has been applied to stem cell data to identify biologically relevant genes in signaling pathways. As a result, the COCA approach uncovers novel pathway members that may shed light into the pathway deregulation in cancers. Conclusion: We have developed a new integrative strategy to combine biological knowledge and microarray data for gene ranking. The method utilizes knowledge genes for a guidance to first extract coordinative components, and then rank the genes according to their contribution related to a network or pathway. The experimental results show that such a knowledge-guided strategy can provide context-specific gene ranking with an improved performance in pathway member identification. - Background It is of great interest to identify genes strongly associated with the functionality of gene networks or signal transduction pathways particularly from gene expression microarray data. Two of the earliest approaches to identify such genes are fold-change and multiple t-testing; each aims to rank the genes in the order of their differential expressions under various experimental conditions. Many improvements to the original t-test method have been proposed for microarray data analysis. For example, significant analysis of microarray (SAM) [1] uses a modified t-statistic with an added estimator for gene ranking in which the false discovery rate (FDR) is estimated by a permutation procedure. A bootstrapped p-value approach was introduced in [2] to address the inherent variability in small sample studies. Prior studies have shown that fold-change is more robust than t-test with respect to the reproducibility of gene rankings [3], while other researchers argue that better reproducibility does not guarantee the accuracy of gene ranking[4]. Nonetheless, both methods are severely limited because they neglect the interaction among genes, prioritizing gene relevance only based on individual gene expression values. To address the above-mentioned problem, several gene ranking methods have been proposed to either consider the joint effect of genes or to explore the expression pattern in time-course data. For instance, Opgen-Rhein & Strimmer [5] introduced the "shrinkage t" statistic that is based on a novel and model-free shrinkage estimate of the variance vector across genes. Storey et al. [6] proposed a method (EDGE) to first fit the time-course expression pattern by splines, and then rank genes by hypothesis testing on the spline parameters. Furlanello et al. [7] proposed a classification-based feature elimination scheme to rank genes by iteratively discarding chunks of genes showing least contribution to the classifier. In contrast, other investigators have proposed incorporating biological knowledge for gene ranking. GeneRank [8] ranks genes by integrating gene expression and network structure derived from gene annotations. Ma et al. [9] proposed a strategy to combine gene expression and protein-protein interaction (PPI) knowledge, ranking genes by their association with phenotype calibrated by the PPI information. However, such data integration, while widely adopted, is usually done in a heuristic way and lacks an objective estimate of the true interplay between biological knowledge and gene expression data. In this paper, we propose a knowledge-guided gene ranking scheme, namely a coordinative component analysis (COCA) algorithm, to model explicitly those genes that are most likely to be expressed in a coordinative manner within a specific biological context. We consider the genes that belong to a pathway or a network as a whole, rather than treating genes as independent or individual measures. To enhance the biological relevance of gene ranking, gene organization requires that the intrinsic coordination among the genes be defined by biological knowledge. Specifically, biological knowledge, which could be the gene sets within a biological pathway or subnetwork derived from relevant biological databases, is used to guide the algorithm. Thus, we can address the conditional specificity of biological context, for example, where the deregulation of a network only occurs under specific conditions. We rank each individual gene by evaluating its participation or involvement in the pathway of interest, when projected onto the coordinative direction learned by the COCA algorithm. In COCA, a bootstrapping procedure is also implemented to improve the statistical robustness of the ranking results. We demonstrate that the COCA approach can provide an improved performance as compared to traditional statistical methods using simulation data and published gene expression microarray data including yeast cell cycle data and stem cell time-course data, indicating its effectiveness for incorporating biological knowledge into gene ranking. Methods A flowchart of the proposed approach is shown in Figure 1. Given a gene expression microarray data set, multiple Figure 1 A flowchart of the proposed approach, namely knowledge-guided coordinative component analysis (COCA), for gene ranking. A bootstrapping procedure is designed to increase the confidence in estimating the coordinative component (W) and participation vector (A). data sets are first generated through bootstrap resampling of the genes in the array. The bootstrapping procedure is used to overcome the over-fitting (...truncated)