Visualization of Pairwise and Multilocus Linkage Disequilibrium Structure Using Latent Forests
Leray P (2011) Visualization of Pairwise and Multilocus Linkage Disequilibrium Structure Using Latent Forests. PLoS
ONE 6(12): e27320. doi:10.1371/journal.pone.0027320
Visualization of Pairwise and Multilocus Linkage Disequilibrium Structure Using Latent Forests
Raphae l Mourad 0
Christine Sinoquet 0
Christian Dina 0
Philippe Leray 0
Konrad Scheffler, University of Stellenbosch, South Africa
0 1 LINA, UMR CNRS 6241, Ecole Polytechnique de l'Universite de Nantes , BP 50609, Nantes, France, 2 LINA, UMR CNRS 6241 , Universite de Nantes , BP 92208, Nantes , France , 3 Institut du Thorax, UMR INSERM 915 , BP 70721, Nantes , France
Linkage disequilibrium study represents a major issue in statistical genetics as it plays a fundamental role in gene mapping and helps us to learn more about human history. The linkage disequilibrium complex structure makes its exploratory data analysis essential yet challenging. Visualization methods, such as the triangular heat map implemented in Haploview, provide simple and useful tools to help understand complex genetic patterns, but remain insufficient to fully describe them. Probabilistic graphical models have been widely recognized as a powerful formalism allowing a concise and accurate modeling of dependences between variables. In this paper, we propose a method for short-range, long-range and chromosome-wide linkage disequilibrium visualization using forests of hierarchical latent class models. Thanks to its hierarchical nature, our method is shown to provide a compact view of both pairwise and multilocus linkage disequilibrium spatial structures for the geneticist. Besides, a multilocus linkage disequilibrium measure has been designed to evaluate linkage disequilibrium in hierarchy clusters. To learn the proposed model, a new scalable algorithm is presented. It constrains the dependence scope, relying on physical positions, and is able to deal with more than one hundred thousand single nucleotide polymorphisms. The proposed algorithm is fast and does not require phase genotypic data.
-
Linkage disequilibrium (LD) refers to non-random associations
of alleles at two or more loci, over the human genome [1,2]. LD is
usually present at short-range, i.e. for distances less than 10 kb [3].
Nevertheless, long-range LD (i.e. LD with distances greater than
100 kb) [3], and LD between different chromosomes [4], are also
observed. Analyzing the extent and distribution of LD represents a
major topic in statistical genetics. For instance, LD plays a
fundamental role in gene mapping: the observation of a large
number of genetic markers over a chromosomic region ensures a
precise localization of (non-observed) causal mutations. Based on
this property, genome-wide association studies (GWASs) [5,6] aim
to systematically localize causal loci over the genome using
hundreds of thousands of single nucleotide polymorphisms (SNPs),
an abundant and useful class of genetic markers. Beside gene
mapping, LD pattern analysis offers deep insights into the
understanding of human population history. Bottlenecks, natural
selection and migrations are examples of evolutionary events
which can be inferred using coalescent models [7].
At the interface between computer science and artificial
intelligence, data mining (DM) is the process of extracting patterns
from data [8]. DM helps formulate hypotheses worth testing and is
complementary to more conventional statistics. Data visualization,
a branch of DM, aims at providing efficient and intuitive tools to
represent and summarize relevant information underlying data
[9]. Data visualization has been successfully applied to
bioinformatics [10].
The international HapMap project [3], and more recently the
international 1000 Genomes project [11], have made considerable
efforts to deeply characterize the genome sequence variation in
human populations. In this context, the application of visualization
methods in the analysis of LD patterns has been shown to be
essential, most notably to reveal the complex so-called LD block
structure [12]. The simplest but also the most popular method is
the triangular heat map (THM) as implemented in Haploview
[13]. The THM is the triangular matrix of pairwise dependences
between genetic markers, in which the color shading indicates the
LD strength in each matrix cell. The THM generally displays the
Lewontin D or the squared correlation coefficient r2. Another
dependence measure, the ratio of the D to the logarithm of odds
(noted LOD), is used as a standard by Haploview. In the THM,
LD blocks are visually apparent. Nevertheless, the THM has the
drawback to only display pairwise dependences, thus providing a
restricted view of multilocus patterns. Another popular approach
consists in plotting the fine-scale map of recombination rates
computed along the chromosomic sequence. For this purpose,
PHASE [14], a coalescent-based method, can be used to estimate
recombination rates between adjacent SNPs in the sequence. This
approach helps find recombination hotspots and provides insight
of the underlying block structure of LD, but leads to
computational burden. More advanced techniques, such as those providing
isometric blocks and bifurcation plots [15], or textile plots [16],
can deal with multilocus LD. For instance, the algorithm used to
draw a textile plot is closely related to principal component
analysis. The textile plot strategy consists in assigning the optimal
geometrical configuration to variables and data points in a
lowdimensional linear space.
At the interface of graph and probability theories, probabilistic
graphical models (PGMs) represent a powerful formalism to
uncover complex networks of interactions. Thanks to their ability
to capture (conditional) independences and dependences between
variables, PGMs offer an accurate modeling of relationships
between variables in an uncertain framework [17]. A PGM is a
probabilistic model that relies on a graph representing conditional
independences within a set of random variables. Essentially, this
model provides a compact and natural representation of the joint
probability distribution of the variable set. PGMs have been
successfully applied to LD modeling, in particular for haplotype
inference and association genetics [1821]. Recently, Mourad et al.
introduced forests of hierarchical latent class models (FHLCMs) to
model genome-wide LD, together with a scalable algorithm,
named CFHLC (Construction of Forests of Hierarchical Latent
Class models), able to cope with 105 variables and 2000
individuals [22,23]. FHLCMs will be described in details in the next
section.
In this paper, we describe another attractive property of
FHLCMs (beside LD modeling) as LD visualization tools. We
advocate their use for: (i) short-range, (ii) long-range and (iii)
chromosome-wide LD visualization. Most notably, these models
provide a compact and interpretable view of LD for the geneticist,
thanks to their hierarchical graphical nature and their latent
vari (...truncated)