A method combining a random forest-based technique with the modeling of linkage disequilibrium through latent variables, to run multilocus genome-wide association studies
Sinoquet BMC Bioinformatics (2018) 19:106
https://doi.org/10.1186/s12859-018-2054-0
METHODOLOGY ARTICLE
Open Access
A method combining a random
forest-based technique with the modeling
of linkage disequilibrium through latent
variables, to run multilocus genome-wide
association studies
Christine Sinoquet
Abstract
Background: Genome-wide association studies (GWASs) have been widely used to discover the genetic basis of
complex phenotypes. However, standard single-SNP GWASs suffer from lack of power. In particular, they do not
directly account for linkage disequilibrium, that is the dependences between SNPs (Single Nucleotide Polymorphisms).
Results: We present the comparative study of two multilocus GWAS strategies, in the random forest-based
framework. The first method, T-Trees, was designed by Botta and collaborators (Botta et al., PLoS ONE 9(4):e93379,
2014). We designed the other method, which is an innovative hybrid method combining T-Trees with the modeling
of linkage disequilibrium. Linkage disequilibrium is modeled through a collection of tree-shaped Bayesian networks
with latent variables, following our former works (Mourad et al., BMC Bioinformatics 12(1):16, 2011). We compared the
two methods, both on simulated and real data. For dominant and additive genetic models, in either of the conditions
simulated, the hybrid approach always slightly performs better than T-Trees. We assessed predictive powers through
the standard ROC technique on 14 real datasets. For 10 of the 14 datasets analyzed, the already high predicted power
observed for T-Trees (0.910-0.946) can still be increased by up to 0.030. We also assessed whether the distributions of
SNPs’ scores obtained from T-Trees and the hybrid approach differed. Finally, we thoroughly analyzed the
intersections of top 100 SNPs output by any two or the three methods amongst T-Trees, the hybrid approach, and the
single-SNP method.
Conclusions: The sophistication of T-Trees through finer linkage disequilibrium modeling is shown beneficial. The
distributions of SNPs’ scores generated by T-Trees and the hybrid approach are shown statistically different, which
suggests complementary of the methods. In particular, for 12 of the 14 real datasets, the distribution tail of highest
SNPs’ scores shows larger values for the hybrid approach. Thus are pinpointed more interesting SNPs than by T-Trees,
to be provided as a short list of prioritized SNPs, for a further analysis by biologists. Finally, among the 211 top 100
SNPs jointly detected by the single-SNP method, T-Trees and the hybrid approach over the 14 datasets, we identified
72 and 38 SNPs respectively present in the top25s and top10s for each method.
Keywords: Genome-wide association study, GWAS, Multilocus approach, Random forest-based approach, Linkage
disequilibrium modeling, Forest of latent tree models, Bayesian network with latent variables, Hybrid approach,
Integration of biological knowledge to GWAS
Correspondence:
LS2N, UMR CNRS 6004, Université de Nantes, 2 rue de la Houssinière, BP
92208, 44322 Nantes Cedex, France
© The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Sinoquet BMC Bioinformatics (2018) 19:106
Background
The etiology of genetic diseases may be elucidated by
localizing genes conferring disease susceptibility and by
subsequent biological characterization of these genes.
Searching the genome for small DNA variations that
occur more frequently in subjects with a peculiar disease
(cases) than in unaffected individuals is the key to association studies. These DNA variations are observed at
characterized locations - or loci - of the genome, also
called genetic markers. Nowadays, genotyping technologies allow the description of case and control cohorts (a
few thousand to ten thousand individuals) on the genome
scale (hundred thousands to a few million of genetic markers such as Single Nucleotide Polymorphisms (SNPs)).
The search for associations (i.e. statistical dependences)
between one or several of the markers and the disease
is called an association study. Genome-wide association
studies (GWASs) are also expected to help identify DNA
variations that affect a subject’s response to drugs or influence interactions between genotype and environment in
a way that may contribute to the on-set of a given disease. Thus, improvement in the prediction of diseases,
patient care and achievement of personalized medicine
are three major aims of GWASs applied to biomedical
research.
Exploiting the existence of statistical dependences
between neighbor SNPs is the key to association studies [1, 2]. Statistical dependences within genetical data
define linkage disequilibrium (LD). To perform GWASs,
geneticists rely on a set of genetic markers, say SNPs,
that cover the whole genome and are observed for any
genotyped individual of a studied population. However,
it is highly unlikely that a causal variant (i.e. a genetic
factor) coincides with a SNP. Nevertheless, due to LD, a
statistical dependence is expected between any SNP that
flanks the unobserved genetic factor and the latter. On
the other hand, by definition, a statistical dependence
exists between the genetic factor responsible for the disease and this disease. Thus, a statistical dependence is
also expected between the flanking SNP and the studied
disease.
A standard single-SNP GWAS considers each SNP on its
own and tests it for association with the disease. GWASs
considering binary affected/unaffected phenotypes rely
on standard contingency table tests (chi-square test, likelihood ratio test, Fisher’s exact test). Linear regression is
broadly used for quantitative phenotypes.
The lack of statistical power is one of the limitations
of single-SNP GWASs. Thus, multilocus strategies were
designed to enhance the identification of a region on
the genome where a genetical factor might be present.
In the scope of this article, a “multilocus” strategy has
to be distinguished from strategies aiming at epistasis
detection. Epistatic interactions exist within a given set
Page 2 of 24
of SNPs when a dependence is observed between this
combination of SNPs and the studied phenotype, whereas
no marginal dependence may be evidenced between
the phenotype and any SNP within this combination.
Underlying epistasis is the concept of biological interactions between loci acting in concert as an organic
group. In this article, a multilocus GWAS approach
aims at focusing on int (...truncated)