A hierarchical Bayesian network approach for linkage disequilibrium modeling and data-dimensionality reduction prior to genome-wide association studies
Mourad et al. BMC Bioinformatics 2011, 12:16
http://www.biomedcentral.com/1471-2105/12/16
RESEARCH ARTICLE
Open Access
A hierarchical Bayesian network approach
for linkage disequilibrium modeling and
data-dimensionality reduction prior to
genome-wide association studies
Raphaël Mourad1*, Christine Sinoquet2*, Philippe Leray1
Abstract
Background: Discovering the genetic basis of common genetic diseases in the human genome represents a
public health issue. However, the dimensionality of the genetic data (up to 1 million genetic markers) and its
complexity make the statistical analysis a challenging task.
Results: We present an accurate modeling of dependences between genetic markers, based on a forest of
hierarchical latent class models which is a particular class of probabilistic graphical models. This model offers an
adapted framework to deal with the fuzzy nature of linkage disequilibrium blocks. In addition, the data
dimensionality can be reduced through the latent variables of the model which synthesize the information borne
by genetic markers. In order to tackle the learning of both forest structure and probability distributions, a generic
algorithm has been proposed. A first implementation of our algorithm has been shown to be tractable on
benchmarks describing 105 variables for 2000 individuals.
Conclusions: The forest of hierarchical latent class models offers several advantages for genome-wide association
studies: accurate modeling of linkage disequilibrium, flexible data dimensionality reduction and biological meaning
borne by latent variables.
Background
Genetic markers such as SNPs are the key to dissecting
the genetic susceptibility of common complex diseases,
such as asthma, diabetes, atherosclerosis and some cancers [1]. The purpose is identifying combinations of
genetic determinants which should accumulate among
affected subjects. Generally, in such combinations, each
genetic variant only exerts a modest impact on the
observed phenotype, whereas, in contrast, the interaction between genetic variants and, possibly, environmental factors is determinant. Decreasing genotyping costs
now enable the generation of hundreds of thousands of
* Correspondence: ;
1
LINA, UMR CNRS 6241, Ecole Polytechnique de l’Université de Nantes, rue
Christian Pauc, BP 50609, 44306 Nantes Cedex 3, France
2
LINA, UMR CNRS 6241, Université de Nantes, 2 rue de la Houssinie’re, BP
92208, 44322 Nantes Cedex, France
Full list of author information is available at the end of the article
SNPs, spanning the whole human genome, across
cohorts of cases and controls. This scaling up to genome-wide association studies (GWASs) makes the analysis of high-dimensional data a hot topic [2]. Despite
recent technological advances and extensive research
effort, the genetic basis of the aforementioned diseases
remains to a large extent unknown. Yet, the search for
associations between single SNPs and the variable
describing case/control status requires carrying out a
large number of statistical tests. Since SNP patterns,
rather than single SNPs, are likely to be determinant for
complex diseases, a high rate of false positives as well as
a perceptible statistical power decrease, not to mention
intractability, are severe issues to be overcome.
The simplest type of genetic polymorphism, single
nucleotide polymorphism (SNP), involves only one
nucleotide change, which occurred generations ago
within the DNA sequence. To fix ideas, we emphasize
that one single individual can be uniquely defined by
© 2011 Mourad et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Mourad et al. BMC Bioinformatics 2011, 12:16
http://www.biomedcentral.com/1471-2105/12/16
Page 2 of 20
Table 1 Comparison of running times, dimension reduction rates and entropy compression rates between CFHLC and
other algorithms, for Daly et al.’s dataset: Daly et al.’s method [29], Gerbil [25], HaploBlock [13] and Zhang et al.’s
algorithm [16]
Algorithm
Running time
Dimension reduction rates
Entropy compression rates
Daly et al.’s method
-
0.107
0.313
Gerbil
40 s
0.107
0.300
HaploBlock
158 mn
0.066
0.241
Zhang et al.’s algorithm
168 s
0.078
0.229
CFHLC
84 s
0.146
0.231
We ran the last three programs on a standard computer. As we had no access to Daly et al.’s software, we could only compare the dimension reduction rates
and entropy compression rates calculated from their results with the dimension reduction rates and entropy compression rates obtained with the other methods.
only 30 to 80 independent SNPs and unrelated individuals differ in about 0.1% of their 3.1 billion nucleotides
[3]. Compared with other kinds of DNA markers, SNPs
are appealing because they are abundant, genetically
stable and amenable to high-throughput automated analysis. Consistently, advances in high-throughput SNP
genotyping technologies lead the way to various downstream analyses, including GWASs.
Exploiting the existence of statistical dependences
between neighboring SNPs, also called linkage disequilibrium (LD), is the key to association study achievement
[4]. Indeed, a causal variant (i.e. a genetic factor) may
not be a SNP. For instance, insertions, deletions, inversions and copy-number polymorphisms may be causative of disease susceptibility. Nevertheless, a welldesigned study will have a good chance of including one
or more SNPs that are in strong LD with a common
causal variant. In the latter case, indirect association
with the phenotype, say affected/unaffected status, will
be revealed (see Additional file 1).
Interestingly, LD also offers solutions to reduce data
dimensionality in GWASs. In the human genome, LD is
highly structured into the so-called “haplotype block
structure” [5]: regions where statistical dependences
between contiguous markers (called blocks) are high
alternate with shorter regions characterized by low statistical dependences (see Additional file 2). The most
likely explanation of this phenomenon is related to the
presence of large regions with low recombination rates
separated by recombination hotspots (i.e. small specific
regions with high recombination rates) [6]. Relying on
this feature, various approaches were proposed to
achieve data dimensionality reduction: testing association with haplotypes (i.e. inferred data underlying genotypic data) [7], partitioning the genome according to
spatial correlation [8], selecting SNPs informative about
their context, or SNP tags [9] (for more references, see
[10] for example). Recent methods, such as HaploBuild
[11], have permitted to construct more biologically relevant haplotypes where the “haplotype cluster structure”,
instead of the “haplotype block structure”, is assumed:
haplotypes are not constrained by (...truncated)