State Space Model with hidden variables for reconstruction of gene regulatory networks
Wu et al. BMC Systems Biology 2011, 5(Suppl 3):S3
http://www.biomedcentral.com/1752-0509/5/S3/S3
RESEARCH
Open Access
State Space Model with hidden variables for
reconstruction of gene regulatory networks
Xi Wu1, Peng Li2, Nan Wang1, Ping Gong3, Edward J Perkins4, Youping Deng5, Chaoyang Zhang1*
From BIOCOMP 2010 - The 2010 International Conference on Bioinformatics and Computational Biology
Las Vegas, NV, USA. 12-15 July 2011
Abstract
Background: State Space Model (SSM) is a relatively new approach to inferring gene regulatory networks. It
requires less computational time than Dynamic Bayesian Networks (DBN). There are two types of variables in the
linear SSM, observed variables and hidden variables. SSM uses an iterative method, namely ExpectationMaximization, to infer regulatory relationships from microarray datasets. The hidden variables cannot be directly
observed from experiments. How to determine the number of hidden variables has a significant impact on the
accuracy of network inference. In this study, we used SSM to infer Gene regulatory networks (GRNs) from synthetic
time series datasets, investigated Bayesian Information Criterion (BIC) and Principle Component Analysis (PCA)
approaches to determining the number of hidden variables in SSM, and evaluated the performance of SSM in
comparison with DBN.
Method: True GRNs and synthetic gene expression datasets were generated using GeneNetWeaver. Both DBN and
linear SSM were used to infer GRNs from the synthetic datasets. The inferred networks were compared with the
true networks.
Results: Our results show that inference precision varied with the number of hidden variables. For some regulatory
networks, the inference precision of DBN was higher but SSM performed better in other cases. Although the
overall performance of the two approaches is compatible, SSM is much faster and capable of inferring much larger
networks than DBN.
Conclusion: This study provides useful information in handling the hidden variables and improving the inference
precision.
Introduction
Microarrays can simultaneously measure the expression
of thousands of genes. In the past decade or so, many
time series experiments have employed microarrays to
profile the temporal change of gene expression. For
instance, one can retrieve many time-course gene
expression datasets from the Gene Expression Omnibus
database (http://www.ncbi.nlm.nih.gov/geo/). These
datasets usually have much smaller numbers of time
points, compared to the large number of genes. Here we
* Correspondence:
1
School of Computing, University of Southern Mississippi, Hattiesburg, MS
39406, USA
Full list of author information is available at the end of the article
focus on how to infer gene regulatory networks (GRNs)
from time series microarray datasets.
Any effective GRN inference method has to cope well
with the large number of genes and the small number
of time points that characterize microarray datasets.
During the past few decades, many methods have been
developed, such as Dynamic Bayesian Network (DBN)
[1,2] and Probability Boolean Network (PBN) [3]. However, DBN and PBN cannot be used to infer large networks with hundreds of genes due to computational
overhead. Thus, there is a need to study different
approaches to improving inference accuracy and reducing computational cost.
A State Space Model (SSM) [4-8] has been developed
for GRN inference in recent years. It has attracted much
© 2011 Wu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://
creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the
original work is properly cited.
Wu et al. BMC Systems Biology 2011, 5(Suppl 3):S3
http://www.biomedcentral.com/1752-0509/5/S3/S3
Page 2 of 6
attention because it has a much higher computational
efficiency and can handle noise well. The variables in
SSM can be divided into two groups, hidden variables
and observed variables. Observed variables are expression levels of genes measured by microarray experiments. Hidden variables include aspects of the evolution
process.
In this study, we investigated the performance of SSM
and addressed the effect of the number of hidden variables on inference accuracy. An intuitive way is to let
the number of hidden variables equal that of observed
variables, but SSM may not be convergent. To make it
feasible to infer a large network from a limited number
of time points, we need to determine the number of
hidden variables in SSM. [4,6,7] used Bayesian Information Criterion (BIC), [5] used cross-validation and [9,10]
used Principal Component Analysis (PCA) to determine
the number of hidden variables. These methods give a
unique value for the number of hidden variables under
their corresponding optimal definitions. However, since
we are mostly interested in inference of GRNs, one
should use accuracy of inferred GRNs to define the optimal criteria. That is, the optimal number of hidden variables that leads to the highest accuracy. It is found that
PCA and BIC approaches do not necessarily produce an
optimal number of hidden variables. Instead, simply setting the number of hidden variables may give a better
or compatible accuracy in SSM. To evaluate the overall
performance of SSM with hidden variables, we inferred
a number of GRNs using synthetic datasets with different numbers of genes and time points generated from
GeneNetWeaver [11].
Methods
In this section, we briefly present the SSM method and
two approaches (BIC and PCA) for determining the
number of hidden variables in GRN inference.
State Space Model
There are two kinds of variables in SSM [12-14], hidden
variables xt with dimension m and observed variables yt
with dimension l. SSM consists of system and observation equations:
xt = Fxt−1 + wt
yt = Hxt + vt .
(1)
wt and vt are Gaussian noise term. F is a state transition matrix. H is an observation matrix. Matrices F and
H can be used to determine GRN [7,14]:
C = HF(H H)−1 H .
(2)
We used expectation-maximization (EM) [12,15] to
infer parameters in SSM.
Bayesian Information Criterion
As mentioned above, how to determine the number of
hidden variables is an important factor affecting the
accuracy of inferred GRNs. [4,6,7] used BIC to accomplish this task. We will demonstrate that, BIC cannot
give the optimal solution. According to [12], BIC is
defined as follows:
BIC = ln P(xt , yt |θ ) −
1
Nθ ln N.
2
(3)
P(xt ,yt |θ) is probability given parameter θ; Nθ is the
number of parameters; N is the number of data points.
BIC can be calculated with a given number of hidden
variables. The number of hidden variables that has the
largest BIC will be adopted as the optimal solution.
Principal Component Analysis
Because the number of time points is usually much
smaller than the number of genes, a microarray dataset
yt(t = 1,...T) has redundant information. From another
aspect of view, all measureme (...truncated)