Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks
Advance Access publication on December
Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks
HONGZHE LI 0
JIANG GUI 0
0 Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine , 920 Blockley Hall, 423 Guardian Drive, Philadelphia, PA 19104-6021 , USA
c The Author 2005. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: .
Empirical Bayes thresholding; Graphical models; Microarray; Threshold gradient descent
1. INTRODUCTION
2. GAUSSIAN GRAPHICAL MODELS
We assume that the gene expression data observed are randomly sampled observational or experimental
data from a multivariate normal probability model. Specifically, let X be a random normal p-dimensional
vector and X1, . . . , X p denote the p elements, where p is the number of genes. Let V = {1, . . . , p} be
the set of nodes (genes), and X (k) be the vector of gene expression levels for the kth sample. We assume
that
X ∼ N p(0, )
This model is also called a covariance selection model (Dempster, 1972) or a Gaussian concentration
graph model.
Let [−i ] denote the set {1, 2, . . . , i − 1, i + 1, . . . , p}. In the Gaussian graphical model, it is
wellknown that the partial regression coefficients of Xi on X j in the normal linear regression p(Xi |X[−i])
is −ωi j /ωii , j ∈ [−i ], and the i j th partial correlation between the i th and the j th gene is ρi j =
−ωi j /√ωii ω j j . For a given gene g, we define the neighbor of this gene as
neg = {j : ωg j = 0, j ∈ [−g]},
X g ⊥ X G\(neg∪g)|Xneg .
3. EBT AND THRESHOLD GRADIENT DESCENT REGULARIZATION
Estimation based on EBT when n > p
ρˆi j =
ωˆ ii ωˆ j j
zi j = 21 log 11 −+ ρρˆˆii jj .
We then perform Fisher’s Z -transformation on all the partial correlations and denote the Z -transformed
partial correlation as zi j , i.e.
Following Johnstone and Silverman (2004), we assume the following model for zi j :
zi j = ξi j + i j , i j ∼ N (0, σ 2),
where ξi j is the Z -transformation of the true partial correlation ρi j , σ 2 is the error variance, and the
elements ξi j have a mixture of 0 and Laplace distribution,
where w is the mixture probability and δ0(ξ ) is the density with mass one at zero. From this model, one
can derive the posterior distribution of ξi j . Johnstone and Silverman (2004) suggested to threshold the
values of zi j by the posterior median of ξi j and they showed that the resulting estimate of ξi j is uniformly
bounded over all signals,
p( p − 1) i j
for some constant C0.
After the EBT, we would expect that many of the elements of the precision matrix with very small
values of the partial correlations are thresholded to zero, corresponding to no edges of the Gaussian graph.
This MLE–EBT approach is similar in spirit to that in Schafer and Strimmer (2005) in the settings when
p < n.
Regularized estimation by TGD on the off-diagonal elements
k=1
Based on equation (3.1), the gradient of the loss function with respect to
l(ωd , ωo) = −w(ωd , ωo).
−1
h(ν) = { f j (ν) · g j (ν), j = 1, . . . , q},
1. Set ωo(0) = 0, ωd (0) = 1, ν = 0. 2. Calculate g(ν) = −∂l/∂ωo for the current ωo and ωd .
6. Repeat steps 2–5.
Model selection by cross-validation and bootstrap
k=1
⎝ −nk log | −k | +
i∈Vk
4. SIMULATIONS
Estimation when p < n
We consider Gaussian precision graphs with 40 nodes and the following four precision matrices ( ) with
different degrees of sparsity:
l2( , ˆ ) = tr( −1 ˆ − I )2,
Estimation when p > n
the right panel represents the quadratic loss.
Fig. 4. Results based on simulation for Gaussian graphs with p = 200 and sample size of n = 100. For each plot, the
x-axis is the TGD step, and the y-axis is sensitivity (a), specificity (b), false discovery rate (c), and false-negative rate.
The dashed lines are ±1 SE based on 50 replications.
5. APPLICATIONS TO ISOPRENOID PATHWAYS IN A. Thaliana
antibiotics.
involved.
5.1 Results from the TGD procedure In order to demonstrate whether the proposed TGD method can identify the known isoprenoid pathways of these 40 genes based on the 118 gene expression measurements, we first estimated the precision
Fig. 6. Pathways identified by the tri-graph method by Wille et al. (2004) (left plot) and the SINful approach with
cutoff p-value of 0.50 (right plot) for the 40 genes in the isoprenoid pathways, where the solid arrows are the true
pathways and the curved undirected lines are the estimated edges. For each plot, the left pane includes a subgraph
of the gene module in the MEP pathway and the right panel includes a subgraph of the gene module in the MVA
pathway.
5.2 Comparison with other methods As a comparison, we applied the SINful procedure using the inverse of the sample covariance matrix and
identified by either of the two methods. Even if the p-value is set to 0.50, many false edges between MEP
and MVA pathways are identified and even the tightly connected DXR–HDS
[1-hydroxy-2-methyl-2-(E)butenyl-4-diphosphate synthase] module cannot be identified (see right plot of Figure 6). Similarly, the
MLE–EBT procedure also only identified a few edges and failed to identify the DXR–HDS module (not
shown).
6. DISCUSSION
ACKNOWLEDGMENTS
BARABASI , A. L. AND OLTVAI , Z. N. ( 2004 ). Network biology: understanding the cell's functional organization . Nature Reviews Genetics 5 , 101 - 113 .
DEMPSTER , A. P. ( 1972 ). Covariance selection . Biometrics 28 , 157 - 175 .
DOBRA , A. , JONES, B. , HANS, C., NEVIS, J. AND WEST , M. ( 2004 ). Sparse graphical models for exploring gene expression data . Journal of Multivariate Analysis 90 , 196 - 212 .
DRTON , M. AND PERLMAN, M. D. ( 2003 ). A SINful approach to model selection for Gaussian precision graphs . Technical Report . University of Washington.
EDWARDS , D. ( 2000 ). Introduction to Graphical Modelling, 2nd edition. New York : Springer.
FRIEDMAN , N. ( 2004 ). Inferring cellular networks using probabilistic graphical models . Science 30 , 799 - 805 .
FRIEDMAN , J. H. AND POPESCU, B. E. ( 2004 ). Gradient directed regularization . Technical Report . Stanford University.
GARDNER , T. S., DI BERNARDO, D., LORENZ, D. AND COLLINS, J. J. ( 2003 ). Inferring genetic networks and identifying compound mode of action via expression profiling . Science 301 , 102 - 105 .
GUI , J. AND LI, J . ( 2005 ). Threshold gradient descent method for censored data regression, with applications in pharmacogenomics . Pacific Symposium on Biocomputing 10 , 272 - 283 .
IDEKER , T., THORSSON , V. , RANISH, J. A., CHRISTMAS, R., BUHLER, J., ENG, J. K., BUMGARNER, R., GOODLETT, D. R. , AEBERSOLD, R., AND HOOD, L. ( 2001 ). Integrated genomic and proteomic analyses of a systematically perturbed metabolic network . Science 292 , 929 - 934 .
JEONG , H. , MASON, S. P. , BARABASI, A. L. , AND OLTVAI , Z. N. ( 2001 ). Lethality and centrality in protein networks . Nature 411 , 41 - 42 .
JOHNSTONE, I. M. AND SILVERMAN, B. W. ( 2004 ). Needles and hay in haystacks: empirical Bayes estimates of possibly sparse sequences . Annals of Statistics 32 , 1594 - 1649 .
LIN , S. P. AND PERLMAN, M. D. ( 1985 ). A Monte Carlo comparison of four estimators of a covariance matrix . In Krishnaish, P. R. (ed), Multivariate Analysis , Volume 6 . Amsterdam: North-Holland, pp. 411 - 429 .
MEINSHAUSEN , N. AND BUHLMANN, P. ( 2006 ). Consistent neighbourhood selection for high-dimensional graphs with the lasso . Annals of Statistics (in press).
SCHAFER , J. AND STRIMMER, K. ( 2005 ). An empirical Bayes approach to inferring large-scale gene association networks . Bioinformatics 21 , 754 - 764 .
SEGAL , E. , SHAPIRA, M. , REGEV, A. , PE'ER , D. , BOTSTEIN, D. , KOLLER, D. , AND FRIEDMAN , N. ( 2003 ). Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data . Nature Genetics 34 , 166 - 176 .
TAVAZOIE , S. , HUGHES, J. D., CAMPBELL, M. J. , CHO, R. J. AND CHURCH, G. M. ( 1999 ). Systematic determination of genetic network architecture . Nature Genetics 22 , 281 - 285 .
TEGNER , J. , YEUNG, M. K. , HASTY, J. , AND COLLINS, J. J. ( 2003 ). Reverse engineering gene networks: integrating genetic perturbations with dynamical modeling . Proceedings of the National Academy of Science of the United States of America 100 , 5944 - 5949 .
TIBSHIRANI , R. ( 1996 ). Regression shrinkage and selection via the lasso . Journal of the Royal Statistical Society Series B 58 , 267 - 288 .
WILLE , A. , ZIMMERMANN, P. , VRANOVA, E., FURHOLZ, A. , LAULE, O., BLEULER, S. , HENNIG, L., PRELIC, A. , VON ROHR , P., THIELE, L. et al. ( 2004 ). Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana . Genome Biology 5 , 1 - 13 .
ZOU , H. , HASTIE, T. AND TINSHIRANI, R. ( 2004 ). On the “degrees of freedom” of the lasso . Technical Report. Department of Statistics , Stanford University.