A method combining a random forest-based technique with the modeling of linkage disequilibrium through latent variables, to run multilocus genome-wide association studies (pdf)

Article PDF cannot be displayed. You can download it here:

https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-018-2054-0

A method combining a random forest-based technique with the modeling of linkage disequilibrium through latent variables, to run multilocus genome-wide association studies

Sinoquet BMC Bioinformatics (2018) 19:106 https://doi.org/10.1186/s12859-018-2054-0 METHODOLOGY ARTICLE Open Access A method combining a random forest-based technique with the modeling of linkage disequilibrium through latent variables, to run multilocus genome-wide association studies Christine Sinoquet Abstract Background: Genome-wide association studies (GWASs) have been widely used to discover the genetic basis of complex phenotypes. However, standard single-SNP GWASs suffer from lack of power. In particular, they do not directly account for linkage disequilibrium, that is the dependences between SNPs (Single Nucleotide Polymorphisms). Results: We present the comparative study of two multilocus GWAS strategies, in the random forest-based framework. The first method, T-Trees, was designed by Botta and collaborators (Botta et al., PLoS ONE 9(4):e93379, 2014). We designed the other method, which is an innovative hybrid method combining T-Trees with the modeling of linkage disequilibrium. Linkage disequilibrium is modeled through a collection of tree-shaped Bayesian networks with latent variables, following our former works (Mourad et al., BMC Bioinformatics 12(1):16, 2011). We compared the two methods, both on simulated and real data. For dominant and additive genetic models, in either of the conditions simulated, the hybrid approach always slightly performs better than T-Trees. We assessed predictive powers through the standard ROC technique on 14 real datasets. For 10 of the 14 datasets analyzed, the already high predicted power observed for T-Trees (0.910-0.946) can still be increased by up to 0.030. We also assessed whether the distributions of SNPs’ scores obtained from T-Trees and the hybrid approach differed. Finally, we thoroughly analyzed the intersections of top 100 SNPs output by any two or the three methods amongst T-Trees, the hybrid approach, and the single-SNP method. Conclusions: The sophistication of T-Trees through finer linkage disequilibrium modeling is shown beneficial. The distributions of SNPs’ scores generated by T-Trees and the hybrid approach are shown statistically different, which suggests complementary of the methods. In particular, for 12 of the 14 real datasets, the distribution tail of highest SNPs’ scores shows larger values for the hybrid approach. Thus are pinpointed more interesting SNPs than by T-Trees, to be provided as a short list of prioritized SNPs, for a further analysis by biologists. Finally, among the 211 top 100 SNPs jointly detected by the single-SNP method, T-Trees and the hybrid approach over the 14 datasets, we identified 72 and 38 SNPs respectively present in the top25s and top10s for each method. Keywords: Genome-wide association study, GWAS, Multilocus approach, Random forest-based approach, Linkage disequilibrium modeling, Forest of latent tree models, Bayesian network with latent variables, Hybrid approach, Integration of biological knowledge to GWAS Correspondence: LS2N, UMR CNRS 6004, Université de Nantes, 2 rue de la Houssinière, BP 92208, 44322 Nantes Cedex, France © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Sinoquet BMC Bioinformatics (2018) 19:106 Background The etiology of genetic diseases may be elucidated by localizing genes conferring disease susceptibility and by subsequent biological characterization of these genes. Searching the genome for small DNA variations that occur more frequently in subjects with a peculiar disease (cases) than in unaffected individuals is the key to association studies. These DNA variations are observed at characterized locations - or loci - of the genome, also called genetic markers. Nowadays, genotyping technologies allow the description of case and control cohorts (a few thousand to ten thousand individuals) on the genome scale (hundred thousands to a few million of genetic markers such as Single Nucleotide Polymorphisms (SNPs)). The search for associations (i.e. statistical dependences) between one or several of the markers and the disease is called an association study. Genome-wide association studies (GWASs) are also expected to help identify DNA variations that affect a subject’s response to drugs or influence interactions between genotype and environment in a way that may contribute to the on-set of a given disease. Thus, improvement in the prediction of diseases, patient care and achievement of personalized medicine are three major aims of GWASs applied to biomedical research. Exploiting the existence of statistical dependences between neighbor SNPs is the key to association studies [1, 2]. Statistical dependences within genetical data define linkage disequilibrium (LD). To perform GWASs, geneticists rely on a set of genetic markers, say SNPs, that cover the whole genome and are observed for any genotyped individual of a studied population. However, it is highly unlikely that a causal variant (i.e. a genetic factor) coincides with a SNP. Nevertheless, due to LD, a statistical dependence is expected between any SNP that flanks the unobserved genetic factor and the latter. On the other hand, by definition, a statistical dependence exists between the genetic factor responsible for the disease and this disease. Thus, a statistical dependence is also expected between the flanking SNP and the studied disease. A standard single-SNP GWAS considers each SNP on its own and tests it for association with the disease. GWASs considering binary affected/unaffected phenotypes rely on standard contingency table tests (chi-square test, likelihood ratio test, Fisher’s exact test). Linear regression is broadly used for quantitative phenotypes. The lack of statistical power is one of the limitations of single-SNP GWASs. Thus, multilocus strategies were designed to enhance the identification of a region on the genome where a genetical factor might be present. In the scope of this article, a “multilocus” strategy has to be distinguished from strategies aiming at epistasis detection. Epistatic interactions exist within a given set Page 2 of 24 of SNPs when a dependence is observed between this combination of SNPs and the studied phenotype, whereas no marginal dependence may be evidenced between the phenotype and any SNP within this combination. Underlying epistasis is the concept of biological interactions between loci acting in concert as an organic group. In this article, a multilocus GWAS approach aims at focusing on int (...truncated)