Integrative Analysis Using Module-Guided Random Forests Reveals Correlated Genetic Factors Related to Mouse Weight

PLoS Computational Biology, Mar 2013

Complex traits such as obesity are manifestations of intricate interactions of multiple genetic factors. However, such relationships are difficult to identify. Thanks to the recent advance in high-throughput technology, a large amount of data has been collected for various complex traits, including obesity. These data often measure different biological aspects of the traits of interest, including genotypic variations at the DNA level and gene expression alterations at the RNA level. Integration of such heterogeneous data provides promising opportunities to understand the genetic components and possibly genetic architecture of complex traits. In this paper, we propose a machine learning based method, module-guided Random Forests (mgRF), to integrate genotypic and gene expression data to investigate genetic factors and molecular mechanism underlying complex traits. mgRF is an augmented Random Forests method enhanced by a network analysis for identifying multiple correlated variables of different types. We applied mgRF to genetic markers and gene expression data from a cohort of F2 female mouse intercross. mgRF outperformed several existing methods in our extensive comparison. Our new approach has an improved performance when combining both genotypic and gene expression data compared to using either one of the two types of data alone. The resulting predictive variables identified by mgRF provide information of perturbed pathways that are related to body weight. More importantly, the results uncovered intricate interactions among genetic markers and genes that have been overlooked if only one type of data was examined. Our results shed light on genetic mechanisms of obesity and our approach provides a promising complementary framework to the “genetics of gene expression” analysis for integrating genotypic and gene expression information for analyzing complex traits.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

http://www.ploscompbiol.org/article/fetchObject.action?uri=info%3Adoi%2F10.1371/journal.pcbi.1002956&representation=PDF

Integrative Analysis Using Module-Guided Random Forests Reveals Correlated Genetic Factors Related to Mouse Weight

Zhang W (2013) Integrative Analysis Using Module-Guided Random Forests Reveals Correlated Genetic Factors Related to Mouse Weight. PLoS Comput Biol 9(3): e1002956. doi:10.1371/journal.pcbi.1002956 Integrative Analysis Using Module-Guided Random Forests Reveals Correlated Genetic Factors Related to Mouse Weight Zheng Chen 0 Weixiong Zhang 0 Frederick P. Roth, Harvard Medical School, United States of America 0 1 Department of Computer Science and Engineering, Washington University , St. Louis , Missouri, United States of America, 2 Department of Genetics, Washington University School of Medicine , St. Louis, Missouri , United States of America Complex traits such as obesity are manifestations of intricate interactions of multiple genetic factors. However, such relationships are difficult to identify. Thanks to the recent advance in high-throughput technology, a large amount of data has been collected for various complex traits, including obesity. These data often measure different biological aspects of the traits of interest, including genotypic variations at the DNA level and gene expression alterations at the RNA level. Integration of such heterogeneous data provides promising opportunities to understand the genetic components and possibly genetic architecture of complex traits. In this paper, we propose a machine learning based method, module-guided Random Forests (mgRF), to integrate genotypic and gene expression data to investigate genetic factors and molecular mechanism underlying complex traits. mgRF is an augmented Random Forests method enhanced by a network analysis for identifying multiple correlated variables of different types. We applied mgRF to genetic markers and gene expression data from a cohort of F2 female mouse intercross. mgRF outperformed several existing methods in our extensive comparison. Our new approach has an improved performance when combining both genotypic and gene expression data compared to using either one of the two types of data alone. The resulting predictive variables identified by mgRF provide information of perturbed pathways that are related to body weight. More importantly, the results uncovered intricate interactions among genetic markers and genes that have been overlooked if only one type of data was examined. Our results shed light on genetic mechanisms of obesity and our approach provides a promising complementary framework to the ''genetics of gene expression'' analysis for integrating genotypic and gene expression information for analyzing complex traits. - Funding: This work was supported by the National Institutes of Health (R01GM100364, R01GM086512 and RC1AR058681) and the National Science Foundation (DBI-0743797). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. Most complex traits such as obesity involve a diverse set of genes, intricate interplay among them and subtle interaction between genetic and environment factors. One of the first steps toward a systematic understanding of the genetic basis of a complex trait is the identification of causal genetic elements, e.g. genes, genetic markers and/or single nucleotide polymorphisms (SNPs), whose variations are responsible for the traits. The objective of this challenging task is two-fold: effectively identifying a subset of genetic elements out of a large pool of candidates whose patterns are characteristic of a trait of interest, and accurately predicting the phenotype with a model that accommodate interactions among selected genetic elements. Despite recent advances in high-throughput technologies that have produced an enormous amount of biological data, heterogeneous data types, non-linear relationships among genes and complex phenotypes have made this task difficult. Although conventional linkage analyses and association studies as well as the latest genome-wide association studies (GWAS) have produced a fruitful collection of genomic susceptibility loci for a variety of complex traits and diseases [1,2], they have mainly been able to detect genetic elements of marginal effects while failed to respect epistatic interactions [3,4]; as a result, they have a low power for predicting phenotypes [5]. As an intermediate between genotype and phenotype, gene expression has been proven to be a rich and valuable source of information complementary to genotype information for dissecting complex traits. On one extreme using gene expression data alone, classifiers or regressors have been built to predict disease types or stages with only a small number of disease-related genes [68]. By integrating information of genetics and gene expression, genetics of gene expression-based approaches [911] and network-based approaches [1214] have been independently developed and applied to identify genes related to complex traits. Recently a few machine learning based (...truncated)


This is a preview of a remote PDF: http://www.ploscompbiol.org/article/fetchObject.action?uri=info%3Adoi%2F10.1371/journal.pcbi.1002956&representation=PDF

Zheng Chen, Weixiong Zhang. Integrative Analysis Using Module-Guided Random Forests Reveals Correlated Genetic Factors Related to Mouse Weight, PLoS Computational Biology, 2013, 3, DOI: 10.1371/journal.pcbi.1002956