Screening properties of trend tests in genetic association studies (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41598-023-35929-4.pdf

Screening properties of trend tests in genetic association studies

www.nature.com/scientificreports OPEN Screening properties of trend tests in genetic association studies Zhenzhen Jiang 1,2, Hongping Guo 3 & Jinjuan Wang 4* In genome-wide association study, extracting disease-associated genetic variants among millions of single nucleotide polymorphisms is of great importance. When the response is a binary variable, the Cochran-Armitage trend tests and associated MAX test are among the most widely used methods for association analysis. However, the theoretical guarantees for applying these methods to variable screening have not been built. To fill this gap, we propose screening procedures based on adjusted versions of these methods and prove their sure screening properties and ranking consistency properties. Extensive simulations are conducted to compare the performances of different screening procedures and demonstrate the robustness and efficiency of MAX test-based screening procedure. A case study on a dataset of type 1 diabetes further verifies their effectiveness. With the development of high throughput sequencing techniques, hundreds of thousands of single nucleotide polymorphisms (SNPs) in the genome are recorded, which enables researchers to investigate and treat diseases from the perspective of genetic variants. To identify the disease-related genes or genetic markers among all these SNPs, genome-wide association study (GWAS) is a widely used strategy. Up to now, more than one hundred thousands of SNPs have been identified to be related to many traits1–7. The commonly used GWAS tests the association between the phenotype and each SNP sequentially, obtains a series of test statistics or p-values, and selects the associated SNPs by comparing these statistics or p-values with a given threshold. When the phenotype is binary, Cochran-Armitage trend test (CATT)8 is always used to detect the associated SNPs. It has been shown that when the underlying genetic model is known, where the commonly used ones are recessive, additive or dominant models, CATT has an optimal f orm9,10. However, the true genetic models are always unknown and may be very complicated. For the sake of robustness, an omnibus test called MAX is proposed11,12, which uses the maximum of CATTs under different genetic models as a measure for association. The asymptotical distribution of MAX is given in the work of Zheng et al.13. Since its being raised, MAX has been widely used and investigated. Li et al.14 introduced a selection procedure based on the rank of MAX. Kim et al.15 proposed a SNP selection method based on MAX and a penalized support vector machine strategy. Though CATTs and MAX have concise forms and are extensively used, theoretical properties for the applications of CATTs and MAX to GWAS have not been investigated. To control false discovery rate (FDR) in GWAS, Bonferroni correction strategy and FDR control procedures, such as Benjamini–Hochberg procedure, are two widely used strategies. But they both assume that all the SNPs are independent, which certainly is improperly since linkage disequilibrium usually exists among SNPs and may lead to omission on related SNPs. Considering these drawbacks, feature screening methods are sensible alternatives. Rather than select the associated SNPs directly, feature screening approaches aim to eliminate most of the irrelevant SNPs at first. After a screening procedure, there remains only a small amount of SNPs and researchers can concentrate on these remaining SNPs, which can save much time and work. In the last few years, feature screening methods have been proposed for various situations. Fan and L v16 first proposed a screening method called the sure independence screening approach for Gaussian response and predictors under linear regressions. Since then, sure screening property, which retains all the important predictors with high probability as the sample size goes into infinity, has been regarded as a feature screening criterion. Many screening procedures have been developed for diverse models, such as the generalized linear model17 and additive m odel18 among others. Although many procedures can be directly applied to GWAS with corresponding models and data types, only PC-SIS, proposed in the work of Huang et al.19, is applicable to the considered situation where both the outcome and predictors are categorical. However, PC-SIS does not take the 1 Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, People’s Republic of China. 2University of Chinese Academy of Sciences, Beijing 100049, People’s Republic of China. 3School of Mathematics and Statistics, Hubei Normal University, Huangshi 435002, People’s Republic of China. 4School of Mathematics and Statistics, Beijing Institute of Technology, Beijing 100081, People’s Republic of China. *email: Scientific Reports | (2023) 13:9139 | https://doi.org/10.1038/s41598-023-35929-4 1 Vol.:(0123456789) www.nature.com/scientificreports/ information on genetic model into consideration. Just as mentioned above, CATTs and MAX test consider this information in the association analysis. But their screening properties have not been studied yet. To fill this gap, we propose feature screening methods based on CATTs in different genetic models and MAX test, and investigate their sure screening and rank consistency properties. The rest of paper is organised as follows. In “Trend test”, we briefly describe the trend tests which can be used to evaluate the relationship between a binary variable and a genotype variable. “Independence screening procedure” introduces the independence screening procedures based on the adjusted trend test statistics, and presents sure screening and ranking consistency properties. Simulation studies are conducted in “Simulation studies” . And a case study on type 1 diabetes is demonstrated in “Application to a real dataset”. A conclusion for this work is presented in “Conclusion”. All proofs of theorems are provided in the Supplemental Materials. Trend test CATT evaluates the association between a binary variable and a SNP, and is widely used in case-control genetic data analysis. Compared with Pearson chi-square test, it makes use of the underlying genetic model. Its specific form is as follows. Suppose r cases and s controls are enrolled in the study. For a given SNP, the genotypes can be expressed as aa, Aa and AA, respectively, with A being a high risk candidate allele. In the sample of cases, the counts of aa, Aa and AA are r0 , r1 and r2, respectively. And the corresponding counts in the control samples are s0 , s1 and s2. Thus we have r = r0 + r1 + r2 , s = s0 + s1 + s2. Denote n = r + s and ni = ri + si for i = 0, 1, 2. All these counts are displayed in Table 1. Then CATT can be written as Z= 2 √ n Xi (sri − rsi ) i=0 2 2 rs n Xi2 ni − ( Xi ni )2 i=0 , (1) i=0 where (X0 , X1 , X2 ) is a pre-defined genotype score vector. Note that the optimal score vector for CATT varies across dif (...truncated)