Using Stochastic Causal Trees to Augment Bayesian Networks for Modeling eQTL Datasets
Kyle C Chipman
0
Ambuj K Singh
0
1
0
Biomolecular Science and Engineering Program
,
UC Santa Barbara, Santa Barbara, CA
,
USA
1
Department of Computer Science
,
UC Santa Barbara, Santa Barbara, CA
,
USA
Background: The combination of genotypic and genome-wide expression data arising from segregating populations offers an unprecedented opportunity to model and dissect complex phenotypes. The immense potential offered by these data derives from the fact that genotypic variation is the sole source of perturbation and can therefore be used to reconcile changes in gene expression programs with the parental genotypes. To date, several methodologies have been developed for modeling eQTL data. These methods generally leverage genotypic data to resolve causal relationships among gene pairs implicated as associates in the expression data. In particular, leading studies have augmented Bayesian networks with genotypic data, providing a powerful framework for learning and modeling causal relationships. While these initial efforts have provided promising results, one major drawback associated with these methods is that they are generally limited to resolving causal orderings for transcripts most proximal to the genomic loci. In this manuscript, we present a probabilistic method capable of learning the causal relationships between transcripts at all levels in the network. We use the information provided by our method as a prior for Bayesian network structure learning, resulting in enhanced performance for gene network reconstruction. Results: Using established protocols to synthesize eQTL networks and corresponding data, we show that our method achieves improved performance over existing leading methods. For the goal of gene network reconstruction, our method achieves improvements in recall ranging from 20% to 90% across a broad range of precision levels and for datasets of varying sample sizes. Additionally, we show that the learned networks can be utilized for expression quantitative trait loci mapping, resulting in upwards of 10-fold increases in recall over traditional univariate mapping. Conclusions: Using the information from our method as a prior for Bayesian network structure learning yields large improvements in accuracy for the tasks of gene network reconstruction and expression quantitative trait loci mapping. In particular, our method is effective for establishing causal relationships between transcripts located both proximally and distally from genomic loci.
-
Background
In order to model and dissect the complexity underlying
physiological processes, including diseases,
developmental programs, and responses to pharmacological
treatments, systematic approaches based on genome-wide
data are imperative. Expression profiling technologies,
such as microarray [1,2] and RNA-Seq platforms [3],
provide quantification of mRNA levels on a
genomewide scale, prompting computational methods aimed at
learning a more holistic perspective of cellular processes.
Parallel advancements in the area of genotypic profiling,
including high-throughput sequencing and SNP
detection, offer information complementary to that of
expression data. These concurrent developments pave the way
for genetical genomic studies, which provide the joint
space of expression and genotypic data corresponding to
offspring that arise from a segregating population [4].
To date, eQTL datasets have been published for several
organisms [5-10], providing ample opportunity to
develop novel computational methodologies. The
tandem existence of expression and genotypic data is
especially powerful in that it allows one to reconcile
changes in expression programs in the context of the
specific genetic combinations represented by the
offspring. Since natural genetic variation is the sole source
of perturbation, it is logical to view genomic loci as
epicenters of phenotypic variation in eQTL-derived causal
networks. Consequently, modeling eQTL datasets
enables one to hypothesize on how genotypic variation
results in phenotypic changes.
Already several studies have provided methodologies
aimed at exploiting the genotypic component of eQTL
data to improve causal modeling in gene networks
[9,11-16]. Bing et al. introduced methodology to build
directed networks starting from a set of candidate
cisgenes for each locus [14], establishing directed edges
from candidate cis-genes to distally-located genes. This
approach yields local regulatory models for individual
loci, and the authors also present an innovative
approach based on partial correlations to identify
models where two regulators play complementary roles in
controlling a common set of genes. The methodology of
Bing et al. was later applied to an eQTL dataset
representing Arabidopsis by Keurentjes et al., who also
incorporated information regarding DNA sequence to
improve the estimation of cis-genes [9]. While this
application was successful in providing hypotheses
regarding local regulatory models, it does not resolve
causal orderings amongst distally located transcripts.
Furthermore, modeling local regulatory programs with
respect to individual loci leaves room for improvement
in the sense that each of the respective models are
disjoint. A worthwhile goal is to produce more holistic and
systematic methodologies capable of modeling the
complex interdependencies between multiple loci and
transcripts. Indeed, it has been estimated that the genetic
basis of many transcripts is extensively complex, with
upwards of 50% of transcripts being linked to five or
more loci [17]. The need for a comprehensive and
systematic approach was addressed by Schadt and
colleagues, who developed a novel method to augment
Bayesian networks with probabilistic measures to direct
causal orderings of gene pairs with respect to genomic
loci [11-13,18]. Their method, which is based on a
conditional bivariate normal model, determines if two
transcripts linked to a common locus are best modeled as
causal or independent [12]. Ultimately, the information
generated by their method is incorporated as a prior for
Bayesian network structure learning [19]. This approach
has yielded promising results when applied to yeast [13]
and mouse [12], providing hypotheses regarding the
architecture of eQTL networks. Furthermore, the authors
published a study on synthetic networks to quantify
the performance gains associated with their method [18]
as compared to standard Bayesian network structure
learning. While their method proved efficacious at
resolving causal orientations between correlated
transcripts in the context of a global network, the scope is
generally limited to the upper echelons of the causal
hierarchy, an attribute that stems from their reliance on
using genomic loci as causal anchors. Ideally, one could
commence at the genomic loci, learn the causal orderings
of the most proximal transcripts, then advance down the
causal hierarchy propagating the structural information
gleaned from the upper levels of th (...truncated)