A power set-based statistical selection procedure to locate susceptible rare variants associated with complex traits with sequencing data

Bioinformatics, Aug 2014

Motivation: Existing association methods for rare variants from sequencing data have focused on aggregating variants in a gene or a genetic region because of the fact that analysing individual rare variants is underpowered. However, these existing rare variant detection methods are not able to identify which rare variants in a gene or a genetic region of all variants are associated with the complex diseases or traits. Once phenotypic associations of a gene or a genetic region are identified, the natural next step in the association study with sequencing data is to locate the susceptible rare variants within the gene or the genetic region. Results: In this article, we propose a power set-based statistical selection procedure that is able to identify the locations of the potentially susceptible rare variants within a disease-related gene or a genetic region. The selection performance of the proposed selection procedure was evaluated through simulation studies, where we demonstrated the feasibility and superior power over several comparable existing methods. In particular, the proposed method is able to handle the mixed effects when both risk and protective variants are present in a gene or a genetic region. The proposed selection procedure was also applied to the sequence data on the ANGPTL gene family from the Dallas Heart Study to identify potentially susceptible rare variants within the trait-related genes. Availability and implementation: An R package ‘rvsel’ can be downloaded from http://www.columbia.edu/∼sw2206/ and http://statsun.pusan.ac.kr. Contact: sw2206{at}columbia.edu Supplementary information: Supplementary data are available at Bioinformatics online.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://bioinformatics.oxfordjournals.org/content/30/16/2317.full.pdf

A power set-based statistical selection procedure to locate susceptible rare variants associated with complex traits with sequencing data

A power set-based statistical selection procedure to locate susceptible rare variants associated with complex traits with sequencing data Hokeun Sun 1 Shuang Wang 0 Associate Editor: Janet Kelso 0 Department of Biostatistics, Mailman School of Public Health, Columbia University , New York, NY 10032 , USA 1 Department of Statistics, Pusan National University , Pusan 609-735 , Korea Motivation: Existing association methods for rare variants from sequencing data have focused on aggregating variants in a gene or a genetic region because of the fact that analysing individual rare variants is underpowered. However, these existing rare variant detection methods are not able to identify which rare variants in a gene or a genetic region of all variants are associated with the complex diseases or traits. Once phenotypic associations of a gene or a genetic region are identified, the natural next step in the association study with sequencing data is to locate the susceptible rare variants within the gene or the genetic region. Results: In this article, we propose a power set-based statistical selection procedure that is able to identify the locations of the potentially susceptible rare variants within a disease-related gene or a genetic region. The selection performance of the proposed selection procedure was evaluated through simulation studies, where we demonstrated the feasibility and superior power over several comparable existing methods. In particular, the proposed method is able to handle the mixed effects when both risk and protective variants are present in a gene or a genetic region. The proposed selection procedure was also applied to the sequence data on the ANGPTL gene family from the Dallas Heart Study to identify potentially susceptible rare variants within the trait-related genes. Availability and implementation: An R package 'rvsel' can be downloaded from http://www.columbia.edu/ sw2206/ and http://statsun. pusan.ac.kr. Contact: Supplementary information: Supplementary data are available at Bioinformatics online. 1 INTRODUCTION The fundamental problem with rare variants [with minor allele frequency (MAF) 5 1%] is their low frequency, i.e. the limited number of observed carriers lowers the statistical power to detect phenotypic association with any single rare variant. Thus, almost all existing statistical methods to detect disease/trait-associated rare variants follow the framework of aggregating and testing all rare variants in a gene or a candidate genomic region thereby *To whom correspondence should be addressed. boosting the association signal (Bhatia et al., 2010; Chen et al., 2012; Cheung et al., 2012; Ionita-Laza et al., 2011; Lee et al., 2012; Lin and Tang, 2011; Liu and Leal, 2008, 2010; Madsen and Browning, 2009; Neale et al., 2011; Price et al., 2010; Wu et al., 2011) . The existing methods try to improve this basic idea of aggregating signals in two aspects: first, by more potent extraction of signals from individual rare variants; and second, by better aggregation signals extracted from multiple rare variants in a gene or a genetic region of interest. Improvements of the first kind include upweighing variants as they become rarer (Madsen and Browning, 2009) along with flexibility of the threshold for a rare variant (Price et al., 2010) and accommodation of both risk and protective variants in a genetic region of interest (IonitaLaza et al., 2011) . Methods for more powerfully aggregating statistical signals include kernel-based adaptive clustering, which assigns weights to multi- rather than single-site genotypes (Liu and Leal, 2010) , and the C test statistic, which contrasts the observed versus expected variance of binomially distributed allele counts (Neale et al., 2011) , and regression-based models that extract and aggregate signals (Lee et al., 2012; Lin and Tang, 2011; Wu et al., 2011) . However, these existing rare variant detection methods are not able to identify which rare variants in a gene or a genetic region out of all variants are associated with the complex diseases or traits. Once phenotypic associations of a gene or a genetic region are identified, the natural next step in the association study with sequencing data is to locate the susceptible rare variants within the gene or the genetic region. There have been a few testing procedures based on the subset selection of rare variants such as the variable thresholding (VT) (Price et al., 2010) and RARECOVER (Rcover) (Bhatia et al., 2010) methods. However, VT is not designed to select potentially causal variants within a gene or a genetic region. Rcover collapses multiple rare variants within a gene or a genetic region using the combined multivariate and collapsing test (CMC) proposed by Liu and Leal (2008) . It has low power to identify causal variants when both risk and protective variants are present within a gene or a genetic region. Moreover, Rcover applies the Pearson’s 2 statistic in the test (...truncated)


This is a preview of a remote PDF: https://bioinformatics.oxfordjournals.org/content/30/16/2317.full.pdf

Hokeun Sun, Shuang Wang. A power set-based statistical selection procedure to locate susceptible rare variants associated with complex traits with sequencing data, Bioinformatics, 2014, pp. 2317-2323, 30/16, DOI: 10.1093/bioinformatics/btu207