Modelling and visualizing fine-scale linkage disequilibrium structure
Edwards BMC Bioinformatics
Modelling and visualizing fine-scale linkage disequilibrium structure
0 Department of Molecular Biology and Genetics, Centre for Quantitative Genetics and Genomics , Blichers Alle 20, Tjele 8830 , Denmark
Background: Detailed study of genetic variation at the population level in humans and other species is now possible due to the availability of large sets of single nucleotide polymorphism data. Alleles at two or more loci are said to be in linkage disequilibrium (LD) when they are correlated or statistically dependent. Current efforts to understand the genetic basis of complex phenotypes are based on the existence of such associations, making study of the extent and distribution of linkage disequilibrium central to this endeavour. The objective of this paper is to develop methods to study fine-scale patterns of allelic association using probabilistic graphical models. Results: An efficient, linear-time forward-backward algorithm is developed to estimate chromosome-wide LD models by optimizing a penalized likelihood criterion, and a convenient way to display these models is described. To illustrate the methods they are applied to data obtained by genotyping 8341 pigs. It is found that roughly 20% of the porcine genome exhibits complex LD patterns, forming islands of relatively high genetic diversity. Conclusions: The proposed algorithm is efficient and makes it feasible to estimate and visualize chromosome-wide LD models on a routine basis.
-
Background
Alleles at two loci are said to be in linkage disequilibrium
(LD) when they are correlated or statistically dependent.
The term refers to the idea that in a large
homogeneous population subject to random mating,
recombination between two loci will cause any initial association
between them to vanish over time. In observed data,
however, non-zero allelic associations are pervasive,
particularly at short distances, but also at long distances and
even between chromosomes. These associations arise in
a complex interplay between processes such as mutation,
selection, genetic drift and population admixture, and are
broken down by recombination. The patterns of
association are of interest, partly because they underpin the
relation of genotype to phenotype at the population level,
and partly because they reflect population history.
Patterns of LD may be represented in different ways. A
common method is to display pairwise measures of LD
as triangular heatmaps [1,2]: in these displays, LD blocks
(genomic intervals within which all loci are in high LD)
stand out clearly. Early work in the HapMap project led
researchers to hypothesize that the human genome
consists of a series of disjoint blocks, within which there is
high LD, low haplotype diversity and little recombination,
and that are punctuated by short regions with high
recombination (recombination hotspots) [3-6]. Subsequently
various authors [7,8] reported that genetic variation
follows more complex patterns, for which richer models are
required.
Discrete graphical models [9] (also known as discrete
Markov networks) provide a rich family of statistical
models to describe the distribution of multivariate discrete
data. They may be represented as undirected graphs in
which the nodes represent variables (here, SNPs) and
absent edges represent conditional independence
relations, in the sense that two variables that are not
connected by an edge are conditionally independent given
some other variables. To motivate this focus on
conditional rather than marginal associations, consider three
loci s1, . . . s3, and suppose that initially s2 is
polymorphic and s1 and s2 monomorphic, so that two haplotypes
(1, 1, 1) and (1, 2, 1) are initially present. Suppose further
that a mutation subsequently occurs at s1 in the
haplotype (1, 1, 1), and another at s3 in the haplotype (1, 2, 1),
so that the population now contains the four haplotypes
(1, 1, 1), (1, 2, 1), (2, 1, 1) and (1, 2, 2). Observe that in
general s1 and s3 are marginally associated (are in LD), but in
the subpopulations corresponding to s2 = 1 and s2 = 2
they are unassociated: in other words, they are
conditionally independent given s2. More complex mutation
histories give rise to more complex patterns of
conditional independences that can be represented as graphical
models [8].
Other authors have used graphical models for the
joint distributions of allele frequencies. Usually, in
highdimensional applications, attention is restricted to a
tractable subclass, the decomposable graphical models
[10]. In the first use of decomposable models in this
context [11], models were selected using a greedy algorithm
based on significance tests. In [8,12] methods and
programs for selecting decomposable graphical models using
Monte Carlo Markov Chain (MCMC) sampling were
described. These methods are computationally feasible for
modest numbers of markers (say, several hundreds), but
not for modern SNP arrays with hundreds of thousands of
SNPs per chromosome. To improve efficiency, the search
space may be restricted to graphical models whose
dependence graphs are interval graphs [13,14]. These are graphs
for which each vertex may be associated with an interval
of the real line such that two vertices are connected by an
edge if and only if their intervals overlap. In this context
the ordering of SNPs along the real line is their
physical ordering along the chromosome. MCMC sampling
from this model class may be performed more efficiently
[13,14]. This work was extended in [15] to a more general
subclass of decomposable models, namely those in which
distant marker pairs (i.e., with more than a given
number of intervening markers) are conditionally independent
given the intervening markers.
In an alternative approach [16-18] latent mixtures of
forests have been applied, in order to accommodate
short-, medium- and long-range LD patterns. Also
directed graphs (Bayesian networks) have been applied,
selecting edges and their directions using causal discovery
algorithms [19]. There are close links between
decomposable models and Bayesian networks ([10], Sect. 4.5.1).
A rather different approach to modelling the joint
distribution of allele frequencies [20,21] is implemented
in the software package BEAGLE [22], which is widely
used to process data from SNP arrays. The approach is
based on a class of models arising in the machine
learning literature called acyclic probabilistic finite automata
(APFA) [23]. These are related to time-variant variable
length Markov chains. For phase estimation and
imputation BEAGLE uses an iterative scheme analogous to
the EM algorithm, alternating between sampling from a
haplotype-level model given the observed genotype data
(the E-step) and selecting a haplotype-level model given
the samples (the M-step). A similar computational scheme
for decomposable graphical models has been described
and implemented in the FitGMLD program [15].
Characterization of genetic variation at the popula (...truncated)