Inferring rare disease risk variants based on exact probabilities of sharing by multiple affected relatives
BIOINFORMATICS
ORIGINAL PAPER
Genetics and population analysis
Vol. 30 no. 15 2014, pages 2189–2196
doi:10.1093/bioinformatics/btu198
Advance Access publication April 16, 2014
Inferring rare disease risk variants based on exact probabilities
of sharing by multiple affected relatives
Alexandre Bureau1,2,*, Samuel G. Younkin3, Margaret M. Parker4, Joan E. Bailey-Wilson5,
Mary L. Marazita6, Jeffrey C. Murray7, Elisabeth Mangold8, Hasan Albacha-Hejazi9,
Terri H. Beaty4 and Ingo Ruczinski3,*
1
Associate Editor: Jeffrey Barrett
ABSTRACT
1
Motivation: Family-based designs are regaining popularity for genomic sequencing studies because they provide a way to test cosegregation with disease of variants that are too rare in the
population to be tested individually in a conventional case–control
study.
Results: Where only a few affected subjects per family are
sequenced, the probability that any variant would be shared by all
affected relatives—given it occurred in any one family member—provides evidence against the null hypothesis of a complete absence of
linkage and association. A P-value can be obtained as the sum of the
probabilities of sharing events as (or more) extreme in one or more
families. We generalize an existing closed-form expression for exact
sharing probabilities to more than two relatives per family. When pedigree founders are related, we show that an approximation of sharing
probabilities based on empirical estimates of kinship among founders
obtained from genome-wide marker data is accurate for low levels of
kinship. We also propose a more generally applicable approach based
on Monte Carlo simulations. We applied this method to a study of 55
multiplex families with apparent non-syndromic forms of oral clefts
from four distinct populations, with whole exome sequences available
for two or three affected members per family. The rare single nucleotide variant rs149253049 in ADAMTS9 shared by affected relatives in
three Indian families achieved significance after correcting for multiple
comparisons (p ¼ 2 106 ).
Availability and implementation: Source code and binaries of the R
package RVsharing are freely available for download at http://cran.
r-project.org/web/packages/RVsharing/index.html.
Contact: or
Supplementary information: Supplementary data are available at
Bioinformatics online.
The advent of high-throughput sequencing of whole exomes and
even whole genomes opens the possibility of detecting rare variants (RVs, including those unique to a family, and alleles up to a
frequency of 1% in a population) impacting human health. The
first successful applications of exome sequencing have been with
rare Mendelian traits (Gilissen et al., 2012). A common study
design to discover highly penetrant causal variants that are rare
in families where previous genotyping has not been performed is
to sequence the exome (or increasingly, the whole genome) of
two or three affected subjects, and focus on novel variants predicted to be functional and shared by all sequenced family members as likely causal variants (Gilissen et al., 2012).
Contrary to monogenic Mendelian traits, considerable genetic
heterogeneity must be expected with complex diseases. Familial
forms of numerous common complex diseases are caused by
RVs, supporting the hypothesis that RVs may explain a part
of the so-called ‘missing heritability’ of these diseases, although
the extent of the contribution of RVs to complex disease heritability is an ongoing debate (Gibson, 2012). In a family where
cases cluster, there is a high probability that multiple affected
members carry the same rare disease predisposing variant if such
a variant exists and its penetrance is high (Cirulli and Goldstein,
2010; Wijsman, 2012). This gives an advantage to family samples
over the samples of unrelated individuals, where disease-causing
RVs may be seen only once or twice among tens of thousands of
subjects.
As with Mendelian disorders, it has initially been proposed to
use the RV sharing information to filter out RVs not shared in at
least one family (Feng et al., 2011). For variants sufficiently rare
so copies in the sequenced relatives are almost certainly identical
by descent (IBD), the probability that an RV independent of the
disease and detected in at least one sequenced subject would not
be shared by other sequenced relatives who are affected was
computed by Feng et al. (2011) to quantify the effectiveness of
what they call the ‘concordance filter’ in discarding irrelevant
RVs. We adopt the view that the probability that an RV would
Received on November 20, 2013; revised on March 14, 2014;
accepted on April 9, 2014
*To whom correspondence should be addressed.
INTRODUCTION
Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.
2189
Centre de Recherche de l’Institut Universitaire en Santé Mentale de Québec, G1J 2G3, 2Département de Médecine
Sociale et Préventive, Université Laval, Québec, G1V 0A6 Canada, 3Department of Biostatistics, 4Department of
Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, 5Inherited Disease Research
Branch, National Human Genome Research Institute, National Institutes of Health, Baltimore, MD 21224, 6Department of
Oral Biology, Center for Craniofacial and Dental Genetics, School of Dental Medicine, University of Pittsburgh, PA 15219,
7
Department of Pediatrics, School of Medicine, University of Iowa, IA 52242, USA, 8Institute of Human Genetics,
University of Bonn, Bonn D-53127, Germany and 9Dr. Hejazi Clinic, P.O. Box 2519, Riyadh 11461, Saudi Arabia
A.Bureau et al.
2
founders are unrelated and we assume the variant is rare enough that a
single copy exists among all the alleles present among the nf founders of
the pedigree linking the sequenced subjects. In a generalization, we allow
founders to be related, and allow for up to two copies of the RV to be
introduced into the pedigree by related founders. We finally demonstrate
how RV sharing probabilities computed in a single family can be combined across multiple families where the same variant is seen, and how to
derive the P-value for the hypothesis test.
2.1
Rare variant sharing probability assuming unrelated
founders
We define the following random variables:
Ci Number of copies of the RV received by sequenced subject i,
Fj Indicator variable that founder j introduced one copy of the RV into
the pedigree,
Dij Number of generations (meioses) between subject i and his or her
ancestor j.
For a set of n sequenced subjects, we want to compute the probability
P½RV shared
¼ P½C1 ¼ . . . ¼ Cn ¼ 1jC1 þ . . . þ Cn 1
¼
P½C1 ¼ . . . ¼ Cn ¼ 1
P½C1 þ . . . þ Cn 1
nf
X
P½C1 ¼ . . . ¼ Cn ¼ 1jFj P½Fj
¼
j¼1
nf
X
P½C1 þ . . . þ Cn 1jFj P½Fj
j¼1
where the expression on the third line results from our assumption of a
single copy of that RV among all alleles present in the nf founders. The
probabilities P½F (...truncated)