Visualization of Pairwise and Multilocus Linkage Disequilibrium Structure Using Latent Forests (pdf)

Article PDF cannot be displayed. You can download it here:

https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0027320&type=printable

Visualization of Pairwise and Multilocus Linkage Disequilibrium Structure Using Latent Forests

Leray P (2011) Visualization of Pairwise and Multilocus Linkage Disequilibrium Structure Using Latent Forests. PLoS ONE 6(12): e27320. doi:10.1371/journal.pone.0027320 Visualization of Pairwise and Multilocus Linkage Disequilibrium Structure Using Latent Forests Raphae l Mourad 0 Christine Sinoquet 0 Christian Dina 0 Philippe Leray 0 Konrad Scheffler, University of Stellenbosch, South Africa 0 1 LINA, UMR CNRS 6241, Ecole Polytechnique de l'Universite de Nantes , BP 50609, Nantes, France, 2 LINA, UMR CNRS 6241 , Universite de Nantes , BP 92208, Nantes , France , 3 Institut du Thorax, UMR INSERM 915 , BP 70721, Nantes , France Linkage disequilibrium study represents a major issue in statistical genetics as it plays a fundamental role in gene mapping and helps us to learn more about human history. The linkage disequilibrium complex structure makes its exploratory data analysis essential yet challenging. Visualization methods, such as the triangular heat map implemented in Haploview, provide simple and useful tools to help understand complex genetic patterns, but remain insufficient to fully describe them. Probabilistic graphical models have been widely recognized as a powerful formalism allowing a concise and accurate modeling of dependences between variables. In this paper, we propose a method for short-range, long-range and chromosome-wide linkage disequilibrium visualization using forests of hierarchical latent class models. Thanks to its hierarchical nature, our method is shown to provide a compact view of both pairwise and multilocus linkage disequilibrium spatial structures for the geneticist. Besides, a multilocus linkage disequilibrium measure has been designed to evaluate linkage disequilibrium in hierarchy clusters. To learn the proposed model, a new scalable algorithm is presented. It constrains the dependence scope, relying on physical positions, and is able to deal with more than one hundred thousand single nucleotide polymorphisms. The proposed algorithm is fast and does not require phase genotypic data. - Linkage disequilibrium (LD) refers to non-random associations of alleles at two or more loci, over the human genome [1,2]. LD is usually present at short-range, i.e. for distances less than 10 kb [3]. Nevertheless, long-range LD (i.e. LD with distances greater than 100 kb) [3], and LD between different chromosomes [4], are also observed. Analyzing the extent and distribution of LD represents a major topic in statistical genetics. For instance, LD plays a fundamental role in gene mapping: the observation of a large number of genetic markers over a chromosomic region ensures a precise localization of (non-observed) causal mutations. Based on this property, genome-wide association studies (GWASs) [5,6] aim to systematically localize causal loci over the genome using hundreds of thousands of single nucleotide polymorphisms (SNPs), an abundant and useful class of genetic markers. Beside gene mapping, LD pattern analysis offers deep insights into the understanding of human population history. Bottlenecks, natural selection and migrations are examples of evolutionary events which can be inferred using coalescent models [7]. At the interface between computer science and artificial intelligence, data mining (DM) is the process of extracting patterns from data [8]. DM helps formulate hypotheses worth testing and is complementary to more conventional statistics. Data visualization, a branch of DM, aims at providing efficient and intuitive tools to represent and summarize relevant information underlying data [9]. Data visualization has been successfully applied to bioinformatics [10]. The international HapMap project [3], and more recently the international 1000 Genomes project [11], have made considerable efforts to deeply characterize the genome sequence variation in human populations. In this context, the application of visualization methods in the analysis of LD patterns has been shown to be essential, most notably to reveal the complex so-called LD block structure [12]. The simplest but also the most popular method is the triangular heat map (THM) as implemented in Haploview [13]. The THM is the triangular matrix of pairwise dependences between genetic markers, in which the color shading indicates the LD strength in each matrix cell. The THM generally displays the Lewontin D or the squared correlation coefficient r2. Another dependence measure, the ratio of the D to the logarithm of odds (noted LOD), is used as a standard by Haploview. In the THM, LD blocks are visually apparent. Nevertheless, the THM has the drawback to only display pairwise dependences, thus providing a restricted view of multilocus patterns. Another popular approach consists in plotting the fine-scale map of recombination rates computed along the chromosomic sequence. For this purpose, PHASE [14], a coalescent-based method, can be used to estimate recombination rates between adjacent SNPs in the sequence. This approach helps find recombination hotspots and provides insight of the underlying block structure of LD, but leads to computational burden. More advanced techniques, such as those providing isometric blocks and bifurcation plots [15], or textile plots [16], can deal with multilocus LD. For instance, the algorithm used to draw a textile plot is closely related to principal component analysis. The textile plot strategy consists in assigning the optimal geometrical configuration to variables and data points in a lowdimensional linear space. At the interface of graph and probability theories, probabilistic graphical models (PGMs) represent a powerful formalism to uncover complex networks of interactions. Thanks to their ability to capture (conditional) independences and dependences between variables, PGMs offer an accurate modeling of relationships between variables in an uncertain framework [17]. A PGM is a probabilistic model that relies on a graph representing conditional independences within a set of random variables. Essentially, this model provides a compact and natural representation of the joint probability distribution of the variable set. PGMs have been successfully applied to LD modeling, in particular for haplotype inference and association genetics [1821]. Recently, Mourad et al. introduced forests of hierarchical latent class models (FHLCMs) to model genome-wide LD, together with a scalable algorithm, named CFHLC (Construction of Forests of Hierarchical Latent Class models), able to cope with 105 variables and 2000 individuals [22,23]. FHLCMs will be described in details in the next section. In this paper, we describe another attractive property of FHLCMs (beside LD modeling) as LD visualization tools. We advocate their use for: (i) short-range, (ii) long-range and (iii) chromosome-wide LD visualization. Most notably, these models provide a compact and interpretable view of LD for the geneticist, thanks to their hierarchical graphical nature and their latent vari (...truncated)