Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks

Biostatistics, Apr 2006

Large-scale microarray gene expression data provide the possibility of constructing genetic networks or biological pathways. Gaussian graphical models have been suggested to provide an effective method for constructing such genetic networks. However, most of the available methods for constructing Gaussian graphs do not account for the sparsity of the networks and are computationally more demanding or infeasible, especially in the settings of high dimension and low sample size. We introduce a threshold gradient descent (TGD) regularization procedure for estimating the sparse precision matrix in the setting of Gaussian graphical models and demonstrate its application to identifying genetic networks. Such a procedure is computationally feasible and can easily incorporate prior biological knowledge about the network structure. Simulation results indicate that the proposed method yields a better estimate of the precision matrix than the procedures that fail to account for the sparsity of the graphs. We also present the results on inference of a gene network for isoprenoid biosynthesis in Arabidopsis thaliana. These results demonstrate that the proposed procedure can indeed identify biologically meaningful genetic networks based on microarray gene expression data.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks

Advance Access publication on December Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks HONGZHE LI 0 JIANG GUI 0 0 Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine , 920 Blockley Hall, 423 Guardian Drive, Philadelphia, PA 19104-6021 , USA c The Author 2005. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: . Empirical Bayes thresholding; Graphical models; Microarray; Threshold gradient descent 1. INTRODUCTION 2. GAUSSIAN GRAPHICAL MODELS We assume that the gene expression data observed are randomly sampled observational or experimental data from a multivariate normal probability model. Specifically, let X be a random normal p-dimensional vector and X1, . . . , X p denote the p elements, where p is the number of genes. Let V = {1, . . . , p} be the set of nodes (genes), and X (k) be the vector of gene expression levels for the kth sample. We assume that X ∼ N p(0, ) This model is also called a covariance selection model (Dempster, 1972) or a Gaussian concentration graph model. Let [−i ] denote the set {1, 2, . . . , i − 1, i + 1, . . . , p}. In the Gaussian graphical model, it is wellknown that the partial regression coefficients of Xi on X j in the normal linear regression p(Xi |X[−i]) is −ωi j /ωii , j ∈ [−i ], and the i j th partial correlation between the i th and the j th gene is ρi j = −ωi j /√ωii ω j j . For a given gene g, we define the neighbor of this gene as neg = {j : ωg j = 0, j ∈ [−g]}, X g ⊥ X G\(neg∪g)|Xneg . 3. EBT AND THRESHOLD GRADIENT DESCENT REGULARIZATION Estimation based on EBT when n > p ρˆi j = ωˆ ii ωˆ j j zi j = 21 log 11 −+ ρρˆˆii jj . We then perform Fisher’s Z -transformation on all the partial correlations and denote the Z -transformed partial correlation as zi j , i.e. Following Johnstone and Silverman (2004), we assume the following model for zi j : zi j = ξi j + i j , i j ∼ N (0, σ 2), where ξi j is the Z -transformation of the true partial correlation ρi j , σ 2 is the error variance, and the elements ξi j have a mixture of 0 and Laplace distribution, where w is the mixture probability and δ0(ξ ) is the density with mass one at zero. From this model, one can derive the posterior distribution of ξi j . Johnstone and Silverman (2004) suggested to threshold the values of zi j by the posterior median of ξi j and they showed that the resulting estimate of ξi j is uniformly bounded over all signals, p( p − 1) i j for some constant C0. After the EBT, we would expect that many of the elements of the precision matrix with very small values of the partial correlations are thresholded to zero, corresponding to no edges of the Gaussian graph. This MLE–EBT approach is similar in spirit to that in Schafer and Strimmer (2005) in the settings when p < n. Regularized estimation by TGD on the off-diagonal elements k=1 Based on equation (3.1), the gradient of the loss function with respect to l(ωd , ωo) = −w(ωd , ωo). −1 h(ν) = { f j (ν) · g j (ν), j = 1, . . . , q}, 1. Set ωo(0) = 0, ωd (0) = 1, ν = 0. 2. Calculate g(ν) = −∂l/∂ωo for the current ωo and ωd . 6. Repeat steps 2–5. Model selection by cross-validation and bootstrap k=1 ⎝ −nk log | −k | + i∈Vk 4. SIMULATIONS Estimation when p < n We consider Gaussian precision graphs with 40 nodes and the following four precision matrices ( ) with different degrees of sparsity: l2( , ˆ ) = tr( −1 ˆ − I )2, Estimation when p > n the right panel represents the quadratic loss. Fig. 4. Results based on simulation for Gaussian graphs with p = 200 and sample size of n = 100. For each plot, the x-axis is the TGD step, and the y-axis is sensitivity (a), specificity (b), false discovery rate (c), and false-negative rate. The dashed lines are ±1 SE based on 50 replications. 5. APPLICATIONS TO ISOPRENOID PATHWAYS IN A. Thaliana antibiotics. involved. 5.1 Results from the TGD procedure In order to demonstrate whether the proposed TGD method can identify the known isoprenoid pathways of these 40 genes based on the 118 gene expression measurements, we first estimated the precision Fig. 6. Pathways identified by the tri-graph method by Wille et al. (2004) (left plot) and the SINful approach with cutoff p-value of 0.50 (right plot) for the 40 genes in the isoprenoid pathways, where the solid arrows are the true pathways and the curved undirected lines are the estimated edges. For each plot, the left pane includes a subgraph of the gene module in the MEP pathway and the right panel includes a subgraph of the gene module in the MVA pathway. 5.2 Comparison with other methods As a comparison, we applied the SINful procedure using the inverse of the sample covariance matrix and identified by either of the two methods. Even if the p-value is set to 0.50, many false edges between MEP and MVA pathways are identified and even the tightly connected DXR–HDS [1-hydroxy-2-methyl-2-(E)butenyl-4-diphosphate synthase] module cannot be identified (see right plot of Figure 6). Similarly, the MLE–EBT procedure also only identified a few edges and failed to identify the DXR–HDS module (not shown). 6. DISCUSSION ACKNOWLEDGMENTS BARABASI , A. L. AND OLTVAI , Z. N. ( 2004 ). Network biology: understanding the cell's functional organization . Nature Reviews Genetics 5 , 101 - 113 . DEMPSTER , A. P. ( 1972 ). Covariance selection . Biometrics 28 , 157 - 175 . DOBRA , A. , JONES, B. , HANS, C., NEVIS, J. AND WEST , M. ( 2004 ). Sparse graphical models for exploring gene expression data . Journal of Multivariate Analysis 90 , 196 - 212 . DRTON , M. AND PERLMAN, M. D. ( 2003 ). A SINful approach to model selection for Gaussian precision graphs . Technical Report . University of Washington. EDWARDS , D. ( 2000 ). Introduction to Graphical Modelling, 2nd edition. New York : Springer. FRIEDMAN , N. ( 2004 ). Inferring cellular networks using probabilistic graphical models . Science 30 , 799 - 805 . FRIEDMAN , J. H. AND POPESCU, B. E. ( 2004 ). Gradient directed regularization . Technical Report . Stanford University. GARDNER , T. S., DI BERNARDO, D., LORENZ, D. AND COLLINS, J. J. ( 2003 ). Inferring genetic networks and identifying compound mode of action via expression profiling . Science 301 , 102 - 105 . GUI , J. AND LI, J . ( 2005 ). Threshold gradient descent method for censored data regression, with applications in pharmacogenomics . Pacific Symposium on Biocomputing 10 , 272 - 283 . IDEKER , T., THORSSON , V. , RANISH, J. A., CHRISTMAS, R., BUHLER, J., ENG, J. K., BUMGARNER, R., GOODLETT, D. R. , AEBERSOLD, R., AND HOOD, L. ( 2001 ). Integrated genomic and proteomic analyses of a systematically perturbed metabolic network . Science 292 , 929 - 934 . JEONG , H. , MASON, S. P. , BARABASI, A. L. , AND OLTVAI , Z. N. ( 2001 ). Lethality and centrality in protein networks . Nature 411 , 41 - 42 . JOHNSTONE, I. M. AND SILVERMAN, B. W. ( 2004 ). Needles and hay in haystacks: empirical Bayes estimates of possibly sparse sequences . Annals of Statistics 32 , 1594 - 1649 . LIN , S. P. AND PERLMAN, M. D. ( 1985 ). A Monte Carlo comparison of four estimators of a covariance matrix . In Krishnaish, P. R. (ed), Multivariate Analysis , Volume 6 . Amsterdam: North-Holland, pp. 411 - 429 . MEINSHAUSEN , N. AND BUHLMANN, P. ( 2006 ). Consistent neighbourhood selection for high-dimensional graphs with the lasso . Annals of Statistics (in press). SCHAFER , J. AND STRIMMER, K. ( 2005 ). An empirical Bayes approach to inferring large-scale gene association networks . Bioinformatics 21 , 754 - 764 . SEGAL , E. , SHAPIRA, M. , REGEV, A. , PE'ER , D. , BOTSTEIN, D. , KOLLER, D. , AND FRIEDMAN , N. ( 2003 ). Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data . Nature Genetics 34 , 166 - 176 . TAVAZOIE , S. , HUGHES, J. D., CAMPBELL, M. J. , CHO, R. J. AND CHURCH, G. M. ( 1999 ). Systematic determination of genetic network architecture . Nature Genetics 22 , 281 - 285 . TEGNER , J. , YEUNG, M. K. , HASTY, J. , AND COLLINS, J. J. ( 2003 ). Reverse engineering gene networks: integrating genetic perturbations with dynamical modeling . Proceedings of the National Academy of Science of the United States of America 100 , 5944 - 5949 . TIBSHIRANI , R. ( 1996 ). Regression shrinkage and selection via the lasso . Journal of the Royal Statistical Society Series B 58 , 267 - 288 . WILLE , A. , ZIMMERMANN, P. , VRANOVA, E., FURHOLZ, A. , LAULE, O., BLEULER, S. , HENNIG, L., PRELIC, A. , VON ROHR , P., THIELE, L. et al. ( 2004 ). Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana . Genome Biology 5 , 1 - 13 . ZOU , H. , HASTIE, T. AND TINSHIRANI, R. ( 2004 ). On the “degrees of freedom” of the lasso . Technical Report. Department of Statistics , Stanford University.

This is a preview of a remote PDF:

Hongzhe Li, Jiang Gui. Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks, Biostatistics, 2006, 302-317, DOI: 10.1093/biostatistics/kxj008