A power set-based statistical selection procedure to locate susceptible rare variants associated with complex traits with sequencing data
A power set-based statistical selection procedure to locate susceptible rare variants associated with complex traits with sequencing data
Hokeun Sun 1
Shuang Wang 0
Associate Editor: Janet Kelso
0 Department of Biostatistics, Mailman School of Public Health, Columbia University , New York, NY 10032 , USA
1 Department of Statistics, Pusan National University , Pusan 609-735 , Korea
Motivation: Existing association methods for rare variants from sequencing data have focused on aggregating variants in a gene or a genetic region because of the fact that analysing individual rare variants is underpowered. However, these existing rare variant detection methods are not able to identify which rare variants in a gene or a genetic region of all variants are associated with the complex diseases or traits. Once phenotypic associations of a gene or a genetic region are identified, the natural next step in the association study with sequencing data is to locate the susceptible rare variants within the gene or the genetic region. Results: In this article, we propose a power set-based statistical selection procedure that is able to identify the locations of the potentially susceptible rare variants within a disease-related gene or a genetic region. The selection performance of the proposed selection procedure was evaluated through simulation studies, where we demonstrated the feasibility and superior power over several comparable existing methods. In particular, the proposed method is able to handle the mixed effects when both risk and protective variants are present in a gene or a genetic region. The proposed selection procedure was also applied to the sequence data on the ANGPTL gene family from the Dallas Heart Study to identify potentially susceptible rare variants within the trait-related genes. Availability and implementation: An R package 'rvsel' can be downloaded from http://www.columbia.edu/ sw2206/ and http://statsun. pusan.ac.kr. Contact: Supplementary information: Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
The fundamental problem with rare variants [with minor allele
frequency (MAF) 5 1%] is their low frequency, i.e. the limited
number of observed carriers lowers the statistical power to detect
phenotypic association with any single rare variant. Thus, almost
all existing statistical methods to detect disease/trait-associated
rare variants follow the framework of aggregating and testing all
rare variants in a gene or a candidate genomic region thereby
*To whom correspondence should be addressed.
boosting the association signal
(Bhatia et al., 2010; Chen et al.,
2012; Cheung et al., 2012; Ionita-Laza et al., 2011; Lee et al.,
2012; Lin and Tang, 2011; Liu and Leal, 2008, 2010; Madsen and
Browning, 2009; Neale et al., 2011; Price et al., 2010; Wu et al.,
2011)
. The existing methods try to improve this basic idea of
aggregating signals in two aspects: first, by more potent
extraction of signals from individual rare variants; and second, by
better aggregation signals extracted from multiple rare variants
in a gene or a genetic region of interest. Improvements of the first
kind include upweighing variants as they become rarer
(Madsen
and Browning, 2009)
along with flexibility of the threshold for a
rare variant
(Price et al., 2010)
and accommodation of both risk
and protective variants in a genetic region of interest
(IonitaLaza et al., 2011)
. Methods for more powerfully aggregating
statistical signals include kernel-based adaptive clustering,
which assigns weights to multi- rather than single-site genotypes
(Liu and Leal, 2010)
, and the C test statistic, which contrasts the
observed versus expected variance of binomially distributed allele
counts
(Neale et al., 2011)
, and regression-based models that
extract and aggregate signals
(Lee et al., 2012; Lin and Tang,
2011; Wu et al., 2011)
.
However, these existing rare variant detection methods are not
able to identify which rare variants in a gene or a genetic region
out of all variants are associated with the complex diseases or
traits. Once phenotypic associations of a gene or a genetic region
are identified, the natural next step in the association study with
sequencing data is to locate the susceptible rare variants within
the gene or the genetic region. There have been a few testing
procedures based on the subset selection of rare variants such
as the variable thresholding (VT)
(Price et al., 2010)
and
RARECOVER (Rcover)
(Bhatia et al., 2010)
methods.
However, VT is not designed to select potentially causal variants
within a gene or a genetic region. Rcover collapses multiple rare
variants within a gene or a genetic region using the combined
multivariate and collapsing test (CMC) proposed by
Liu and
Leal (2008)
. It has low power to identify causal variants when
both risk and protective variants are present within a gene or a
genetic region. Moreover, Rcover applies the Pearson’s 2
statistic in the test (...truncated)