Inferring Gene Regulatory Networks from a Population of Yeast Segregants
www.nature.com/scientificreports
OPEN
Received: 31 July 2018
Accepted: 30 November 2018
Published: xx xx xxxx
Inferring Gene Regulatory
Networks from a Population
of Yeast Segregants
Chen Chen1, Dabao Zhang1,3, Tony R. Hazbun2,3 & Min Zhang1,3
Constructing gene regulatory networks is crucial to unraveling the genetic architecture of complex
traits and to understanding the mechanisms of diseases. On the basis of gene expression and single
nucleotide polymorphism data in the yeast, Saccharomyces cerevisiae, we constructed gene regulatory
networks using a two-stage penalized least squares method. A large system of structural equations
via optimal prediction of a set of surrogate variables was established at the first stage, followed by
consistent selection of regulatory effects at the second stage. Using this approach, we identified
subnetworks that were enriched in gene ontology categories, revealing directional regulatory
mechanisms controlling these biological pathways. Our mapping and analysis of expression-based
quantitative trait loci uncovered a known alteration of gene expression within a biological pathway that
results in regulatory effects on companion pathway genes in the phosphocholine network. In addition,
we identify nodes in these gene ontology-enriched subnetworks that are coordinately controlled
by transcription factors driven by trans-acting expression quantitative trait loci. Altogether, the
integration of documented transcription factor regulatory associations with subnetworks defined by a
system of structural equations using quantitative trait loci data is an effective means to delineate the
transcriptional control of biological pathways.
Gene expression is a fundamental step in the flow of information from an organism’s genotype to phenotype. The
genetic information encoded in an organism’s DNA is transferred into a functional gene product (e.g., protein)
via the process of gene expression, and gene expression leads to the formation of the organism’s phenotype. Gene
expression have been found to be associated with a broad range of complex traits and diseases1, and thus play an
important role in determining an organism’s development. Numerous efforts have been made to map phenotypes
to gene expression in order to dissect their genetic basis.
Genes rarely act in isolation; instead, they interact with each other and make up gene regulatory networks
to function as a whole2. The study of this mechanism is crucial for understanding the properties and functions
of genes, which help reveal the genetic architecture of complex traits and diseases. Although genetic experiments can be conducted to discover interactions among genes, this approach can be costly and time consuming.
Alternatively, measurements of gene expression levels reveal gene expression patterns in a specific condition
and can be exploited to infer gene regulatory networks. Various approaches have been proposed to infer gene
regulatory networks using gene expression data, such as relevance networks3–7, Bayesian networks8–11, Gaussian
graphical models12–15, and many others.
Recent advances in sequencing technologies make it feasible to obtain both whole-genome genotype and gene
expression for each individual, i.e., genetical genomics data16. Combining genetics with gene expression reveals
additional information on genetic structure and holds great promise for improving the accuracy of gene regulatory network inference. Numerous genetical genomics experiments, such as the Genotype-Tissue Expression
(GTEx) project17, have been conducted to collect genetical genomics data.
Much effort has been devoted to using genetical genomics data for genome-wide association (GWA) analysis
of gene expression, i.e., expression quantitative trait loci (eQTL) mapping18. Mapping of eQTL intends to elucidate variation of expression traits attributed to genomic variation, and to identify chromosomal loci (i.e., eQTL)
1
Department of Statistics, Purdue University, West Lafayette, IN, 47907, USA. 2Department of Medicinal Chemistry
and Molecular Pharmacology, Purdue University, West Lafayette, IN, 47907, USA. 3Purdue University Center for
Cancer Research, Purdue University, West Lafayette, IN, 47907, USA. Correspondence and requests for materials
should be addressed to D.Z. (email: ) or T.R.H. (email: ) or M.Z. (email:
)
Scientific Reports |
(2019) 9:1197 | https://doi.org/10.1038/s41598-018-37667-4
1
www.nature.com/scientificreports/
of genetic polymorphisms associated to the expression of a gene under investigation. An eQTL located within
the region of the gene under investigation is called a cis-eQTL, otherwise it is called a trans-eQTL. While the cis
effects of a gene represent direct regulations, indirect regulations of trans-eQTL are likely caused by interactions
among genes. These eQTL provide insight on the functional sequences of the gene expression, and thus an indirect interrogation of the functional landscape of gene regulations19.
Gene regulatory networks can be characterized using a system of structural equations20, with each equation
describing the causal effects of cis-eQTL and the regulatory effects of other genes on a given gene. Such a framework makes it feasible to take a genome-wide survey and to directly reveal interactions among genes. Application
of structural equations in genetical genomics studies have been previously demonstrated21–24. Two studies are
applicable to constructing gene regulatory networks for a small number of genes21,22. However, genetical genomics experiments usually collect whole-genome gene expressions for a very limited number of samples, therefore
the number of genes is much larger than the sample size. For such consideration, another study23 proposed to
apply the adaptive lasso25 to construct a sparse gene regulatory network. An additional approach instead proposed to maximize a penalized likelihood for constructing a sparse gene regulatory network24.
Here we construct gene regulatory networks in yeast via building up a large system of structural equations
with the two-stage penalized least squares (2SPLS) method26. We applied the 2SPSLS method to an eQTL dataset derived from a cross between a wild yeast vineyard strain and a laboratory strain27. Fitting one linear model
for each gene at each stage, the 2SPLS method develops optimal prediction of a set of conditional expectations
at the first stage, and consistent selection of regulatory effects from massive candidates at the second stage. It is
computationally fast and allows for parallel implementation, outperforming the adaptive lasso based algorithm23,
and the sparsity-aware maximum likelihood algorithm24, in terms of both accuracy and speed, for identifying
regulatory effects in different network structures. This parallel implementation makes it feasible to evaluate the
significance of regulatory effects via the bootstrap method. Using this approach we identified subnetworks that
were enriched in gene ontology (...truncated)