Integrative Analysis Using Module-Guided Random Forests Reveals Correlated Genetic Factors Related to Mouse Weight
Zhang W (2013) Integrative Analysis Using Module-Guided Random Forests Reveals Correlated Genetic Factors Related to Mouse Weight. PLoS
Comput Biol 9(3): e1002956. doi:10.1371/journal.pcbi.1002956
Integrative Analysis Using Module-Guided Random Forests Reveals Correlated Genetic Factors Related to Mouse Weight
Zheng Chen 0
Weixiong Zhang 0
Frederick P. Roth, Harvard Medical School, United States of America
0 1 Department of Computer Science and Engineering, Washington University , St. Louis , Missouri, United States of America, 2 Department of Genetics, Washington University School of Medicine , St. Louis, Missouri , United States of America
Complex traits such as obesity are manifestations of intricate interactions of multiple genetic factors. However, such relationships are difficult to identify. Thanks to the recent advance in high-throughput technology, a large amount of data has been collected for various complex traits, including obesity. These data often measure different biological aspects of the traits of interest, including genotypic variations at the DNA level and gene expression alterations at the RNA level. Integration of such heterogeneous data provides promising opportunities to understand the genetic components and possibly genetic architecture of complex traits. In this paper, we propose a machine learning based method, module-guided Random Forests (mgRF), to integrate genotypic and gene expression data to investigate genetic factors and molecular mechanism underlying complex traits. mgRF is an augmented Random Forests method enhanced by a network analysis for identifying multiple correlated variables of different types. We applied mgRF to genetic markers and gene expression data from a cohort of F2 female mouse intercross. mgRF outperformed several existing methods in our extensive comparison. Our new approach has an improved performance when combining both genotypic and gene expression data compared to using either one of the two types of data alone. The resulting predictive variables identified by mgRF provide information of perturbed pathways that are related to body weight. More importantly, the results uncovered intricate interactions among genetic markers and genes that have been overlooked if only one type of data was examined. Our results shed light on genetic mechanisms of obesity and our approach provides a promising complementary framework to the ''genetics of gene expression'' analysis for integrating genotypic and gene expression information for analyzing complex traits.
-
Funding: This work was supported by the National Institutes of Health (R01GM100364, R01GM086512 and RC1AR058681) and the National Science Foundation
(DBI-0743797). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
Most complex traits such as obesity involve a diverse set of
genes, intricate interplay among them and subtle interaction
between genetic and environment factors. One of the first steps
toward a systematic understanding of the genetic basis of a
complex trait is the identification of causal genetic elements, e.g.
genes, genetic markers and/or single nucleotide polymorphisms
(SNPs), whose variations are responsible for the traits. The
objective of this challenging task is two-fold: effectively identifying
a subset of genetic elements out of a large pool of candidates whose
patterns are characteristic of a trait of interest, and accurately
predicting the phenotype with a model that accommodate
interactions among selected genetic elements. Despite recent
advances in high-throughput technologies that have produced an
enormous amount of biological data, heterogeneous data types,
non-linear relationships among genes and complex phenotypes
have made this task difficult.
Although conventional linkage analyses and association studies
as well as the latest genome-wide association studies (GWAS) have
produced a fruitful collection of genomic susceptibility loci for a
variety of complex traits and diseases [1,2], they have mainly been
able to detect genetic elements of marginal effects while failed to
respect epistatic interactions [3,4]; as a result, they have a low
power for predicting phenotypes [5]. As an intermediate between
genotype and phenotype, gene expression has been proven to be a
rich and valuable source of information complementary to
genotype information for dissecting complex traits. On one
extreme using gene expression data alone, classifiers or regressors
have been built to predict disease types or stages with only a small
number of disease-related genes [68]. By integrating information
of genetics and gene expression, genetics of gene expression-based
approaches [911] and network-based approaches [1214] have
been independently developed and applied to identify genes
related to complex traits. Recently a few machine learning based (...truncated)