Probabilistic graphical models for genetic association studies (pdf)

Article PDF cannot be displayed. You can download it here:

https://bib.oxfordjournals.org/content/13/1/20.full.pdf

Probabilistic graphical models for genetic association studies

B RIEFINGS IN BIOINF ORMATICS . VOL 13. NO 1. 20 ^33 Advance Access published on 30 March 2011 doi:10.1093/bib/bbr015 Probabilistic graphical models for genetic association studies Raphae« l Mourad, Christine Sinoquet and Philippe Leray Submitted: 16th December 2010; Received (in revised form) : 25th February 2011 Abstract Keywords: machine learning; probabilistic graphical models; genetic association studies; linkage disequilibrium INTRODUCTION Most complex genetic disorders, i.e. disorders caused by a combination of genetic and environmental factors, are common in the human population: asthma, obesity, diabetes and some cancers, to cite a few examples [1]. To explain this phenomenon, the common disease-common variant (often abbreviated CD-CV) hypothesis states that a few common allelic variants could account for the genetic variation in disease susceptibility [2]. In this context, population association studies (PASs) were proposed as a promising approach to discover the genetic basis of these complex diseases that is known to be a major public health issue [3]. The PASs’ principle is quite simple: it consists in testing whether allelic frequencies are different between non-affected and affected unrelated individuals (N.B: the disease status indicator is also called phenotype). These studies exploit the existence of non-random associations of alleles at two or more loci, over the human genome. These associations are usually observed between close loci on chromosomes and are called linkage disequilibrium (LD). Thanks to LD, the observation of a large number of genetic markers such as single nucleotide polymorphisms (SNPs) over a chromosomic region is expected to allow an accurate localization of the (unobserved) causal mutations involved in the disease etiology. Two main strategies of PASs have been developed: hypothesis driven and nonhypothesis driven. Hypothesis-driven methods start with the assumption that a particular region (in fine mapping studies) or a gene or a set of genes (in candidate gene studies) may be associated with the disease, and try to closely localize causal mutations. Corresponding author. Raphaël Mourad, Ecole Polytechnique de l’Université de Nantes, rue Christian Pauc, BP 50609, 44306 Nantes Cedex 3, France. Tel: þ33 2 40 68 30 49; Fax: þ33 2 40 68 30 77; E-mail: Raphae« l Mourad is a PhD student in the Knowledge and Decision group at the Computer Science Laboratory of Nantes-Atlantic, Polytechnic School of Nantes. He works on the development of probabilistic graphical models applied to linkage disequilibrium modeling and genome-wide association studies. Christine Sinoquet is an associate professor in the Knowledge and Decision group at the Computer Science Laboratory of Nantes-Atlantic, University of Nantes, France. Her research interests include motif discovery in biological sequences, comparative genomics, imputation of missing genotypic data and dissecting the genetic susceptibility of complex diseases. She currently serves as the Head of the Master degree program in Bioinformatics of the University of Nantes since 2005. Philippe Leray is a full professor in the Knowledge and Decision group at the Computer Science Laboratory of Nantes-Atlantic, Polytechnic School of Nantes. He teaches from basic statistics to probabilistic graphical models. Since September 2008, he is also the Head of the Department of Computer Science. He has been working more intensively on the Bayesian network field for the past 10 years with interests for theory (Bayesian network structure learning, causality) and application (reliability, intrusion detection, bioinformatics). ß The Author 2011. Published by Oxford University Press. For Permissions, please email: Probabilistic graphical models have been widely recognized as a powerful formalism in the bioinformatics field, especially in gene expression studies and linkage analysis. Although less well known in association genetics, many successful methods have recently emerged to dissect the genetic architecture of complex diseases. In this review article, we cover the applications of these models to the population association studies’ context, such as linkage disequilibrium modeling, fine mapping and candidate gene studies, and genome-scale association studies. Significant breakthroughs of the corresponding methods are highlighted, but emphasis is also given to their current limitations, in particular, to the issue of scalability. Finally, we give promising directions for future research in this field. Graphical models for genetics role, for the last few years, in the dissection of the human genetic heredity. This article is organized as follows. As a prerequisite to further understanding, the ‘Fundamentals of PGMs’ section presents an informal introduction to PGMs. In the ‘LD modeling’ section, we review PGMs designed to model genetic marker dependences and their applications to data dimension reduction and haplotype inference. The next section covers PGMs in hypothesis-driven methods, such as fine mapping and candidate-gene studies. Then, ‘GWASs’ section focuses on the leads investigated to deal with the very large amount of genetic data and to efficiently extract main and epistatic effects. In the next section, all PGM-based methods for PASs are compared, and their main advantages and drawbacks are discussed. Finally, the last Section points out promising perspectives. FUNDAMENTALS OF PGMs In this section, we only focus on the most useful aspects of PGMs needed to understand this review. For a complete introduction to PGMs, readers are referred to Ref. [7]. In this preliminary section, we will restrain the study to discrete and finite variables. First, let X ¼ {X1, . . . , Xn} be a set of n random variables. As mentioned before, a central property encompassed in PGMs is conditional independence. Using this property, it is possible to distinguish direct (or conditional) dependences between variables from indirect (or marginal) dependences, which are defined hereafter. Definition 1 Marginal independence (MI) between two variables Xi and Xj, noted Xi ? Xj, is defined referring to the joint probability distribution (JPD) P(Xi, Xj): PðXi ; Xj Þ ¼ PðXi ÞPðXj Þ: A non-equality implies that Xi and Xj are marginally dependent. Definition 2 Conditional independence (CI) between two variables Xi and Xj knowing a subset of variables S 7 X \{Xi, Xj}, noted Xi ? Xj W S, is: PðXi ; Xj WSÞ ¼ PðXi WSÞ PðXj WSÞ: A non-equality implies that Xi and Xj are conditionally dependent knowing S. In contrast, non-hypothesis-driven studies, such as genome-wide association studies (GWASs) [4] generally use brute-force methods to scan the overall genome for associations, and present the advantage of not requiring a priori information. Usually, in PASs, the geneticist can only observe unphased data (genotypes), i.e. allelic compositions on chromosomes. Conversely, phased data (haplotypes), which carry more natural informati (...truncated)