Probabilistic graphical models for genetic association studies
B RIEFINGS IN BIOINF ORMATICS . VOL 13. NO 1. 20 ^33
Advance Access published on 30 March 2011
doi:10.1093/bib/bbr015
Probabilistic graphical models for
genetic association studies
Raphae« l Mourad, Christine Sinoquet and Philippe Leray
Submitted: 16th December 2010; Received (in revised form) : 25th February 2011
Abstract
Keywords: machine learning; probabilistic graphical models; genetic association studies; linkage disequilibrium
INTRODUCTION
Most complex genetic disorders, i.e. disorders caused
by a combination of genetic and environmental factors, are common in the human population: asthma,
obesity, diabetes and some cancers, to cite a few examples [1]. To explain this phenomenon, the
common disease-common variant (often abbreviated
CD-CV) hypothesis states that a few common allelic
variants could account for the genetic variation in
disease susceptibility [2]. In this context, population
association studies (PASs) were proposed as a promising approach to discover the genetic basis of these
complex diseases that is known to be a major public
health issue [3]. The PASs’ principle is quite simple:
it consists in testing whether allelic frequencies are
different between non-affected and affected unrelated individuals (N.B: the disease status indicator is
also called phenotype). These studies exploit the existence of non-random associations of alleles at two
or more loci, over the human genome. These associations are usually observed between close loci on
chromosomes and are called linkage disequilibrium
(LD). Thanks to LD, the observation of a large
number of genetic markers such as single nucleotide
polymorphisms (SNPs) over a chromosomic region
is expected to allow an accurate localization of
the (unobserved) causal mutations involved in the
disease etiology. Two main strategies of PASs
have been developed: hypothesis driven and nonhypothesis driven. Hypothesis-driven methods start
with the assumption that a particular region (in fine
mapping studies) or a gene or a set of genes (in candidate gene studies) may be associated with the disease, and try to closely localize causal mutations.
Corresponding author. Raphaël Mourad, Ecole Polytechnique de l’Université de Nantes, rue Christian Pauc, BP 50609, 44306 Nantes
Cedex 3, France. Tel: þ33 2 40 68 30 49; Fax: þ33 2 40 68 30 77; E-mail:
Raphae« l Mourad is a PhD student in the Knowledge and Decision group at the Computer Science Laboratory of Nantes-Atlantic,
Polytechnic School of Nantes. He works on the development of probabilistic graphical models applied to linkage disequilibrium
modeling and genome-wide association studies.
Christine Sinoquet is an associate professor in the Knowledge and Decision group at the Computer Science Laboratory of
Nantes-Atlantic, University of Nantes, France. Her research interests include motif discovery in biological sequences, comparative
genomics, imputation of missing genotypic data and dissecting the genetic susceptibility of complex diseases. She currently serves as the
Head of the Master degree program in Bioinformatics of the University of Nantes since 2005.
Philippe Leray is a full professor in the Knowledge and Decision group at the Computer Science Laboratory of Nantes-Atlantic,
Polytechnic School of Nantes. He teaches from basic statistics to probabilistic graphical models. Since September 2008, he is also the
Head of the Department of Computer Science. He has been working more intensively on the Bayesian network field for the past 10
years with interests for theory (Bayesian network structure learning, causality) and application (reliability, intrusion detection,
bioinformatics).
ß The Author 2011. Published by Oxford University Press. For Permissions, please email:
Probabilistic graphical models have been widely recognized as a powerful formalism in the bioinformatics field, especially in gene expression studies and linkage analysis. Although less well known in association genetics, many successful methods have recently emerged to dissect the genetic architecture of complex diseases. In this review article,
we cover the applications of these models to the population association studies’ context, such as linkage disequilibrium modeling, fine mapping and candidate gene studies, and genome-scale association studies. Significant breakthroughs of the corresponding methods are highlighted, but emphasis is also given to their current limitations,
in particular, to the issue of scalability. Finally, we give promising directions for future research in this field.
Graphical models for genetics
role, for the last few years, in the dissection of the
human genetic heredity.
This article is organized as follows. As a prerequisite to further understanding, the ‘Fundamentals of
PGMs’ section presents an informal introduction to
PGMs. In the ‘LD modeling’ section, we review
PGMs designed to model genetic marker dependences and their applications to data dimension reduction and haplotype inference. The next section
covers PGMs in hypothesis-driven methods, such
as fine mapping and candidate-gene studies. Then,
‘GWASs’ section focuses on the leads investigated to
deal with the very large amount of genetic data and
to efficiently extract main and epistatic effects. In the
next section, all PGM-based methods for PASs are
compared, and their main advantages and drawbacks
are discussed. Finally, the last Section points out promising perspectives.
FUNDAMENTALS OF PGMs
In this section, we only focus on the most useful
aspects of PGMs needed to understand this review.
For a complete introduction to PGMs, readers are
referred to Ref. [7]. In this preliminary section, we
will restrain the study to discrete and finite variables.
First, let X ¼ {X1, . . . , Xn} be a set of n random
variables.
As mentioned before, a central property encompassed in PGMs is conditional independence. Using
this property, it is possible to distinguish direct (or
conditional) dependences between variables from indirect (or marginal) dependences, which are defined
hereafter.
Definition 1
Marginal independence (MI) between two variables Xi and
Xj, noted Xi ? Xj, is defined referring to the joint probability
distribution (JPD) P(Xi, Xj):
PðXi ; Xj Þ ¼ PðXi ÞPðXj Þ:
A non-equality implies that Xi and Xj are marginally
dependent.
Definition 2
Conditional independence (CI) between two variables Xi and
Xj knowing a subset of variables S 7 X \{Xi, Xj}, noted
Xi ? Xj W S, is:
PðXi ; Xj WSÞ ¼ PðXi WSÞ PðXj WSÞ:
A non-equality implies that Xi and Xj are conditionally dependent knowing S.
In contrast, non-hypothesis-driven studies, such as
genome-wide association studies (GWASs) [4] generally use brute-force methods to scan the overall
genome for associations, and present the advantage
of not requiring a priori information. Usually, in
PASs, the geneticist can only observe unphased
data (genotypes), i.e. allelic compositions on
chromosomes. Conversely, phased data (haplotypes),
which carry more natural informati (...truncated)