Screening properties of trend tests in genetic association studies
www.nature.com/scientificreports
OPEN
Screening properties of trend tests
in genetic association studies
Zhenzhen Jiang 1,2, Hongping Guo 3 & Jinjuan Wang 4*
In genome-wide association study, extracting disease-associated genetic variants among millions of
single nucleotide polymorphisms is of great importance. When the response is a binary variable, the
Cochran-Armitage trend tests and associated MAX test are among the most widely used methods
for association analysis. However, the theoretical guarantees for applying these methods to variable
screening have not been built. To fill this gap, we propose screening procedures based on adjusted
versions of these methods and prove their sure screening properties and ranking consistency
properties. Extensive simulations are conducted to compare the performances of different screening
procedures and demonstrate the robustness and efficiency of MAX test-based screening procedure. A
case study on a dataset of type 1 diabetes further verifies their effectiveness.
With the development of high throughput sequencing techniques, hundreds of thousands of single nucleotide
polymorphisms (SNPs) in the genome are recorded, which enables researchers to investigate and treat diseases
from the perspective of genetic variants. To identify the disease-related genes or genetic markers among all these
SNPs, genome-wide association study (GWAS) is a widely used strategy. Up to now, more than one hundred
thousands of SNPs have been identified to be related to many traits1–7.
The commonly used GWAS tests the association between the phenotype and each SNP sequentially, obtains
a series of test statistics or p-values, and selects the associated SNPs by comparing these statistics or p-values
with a given threshold. When the phenotype is binary, Cochran-Armitage trend test (CATT)8 is always used
to detect the associated SNPs. It has been shown that when the underlying genetic model is known, where the
commonly used ones are recessive, additive or dominant models, CATT has an optimal f orm9,10. However, the
true genetic models are always unknown and may be very complicated. For the sake of robustness, an omnibus
test called MAX is proposed11,12, which uses the maximum of CATTs under different genetic models as a measure
for association. The asymptotical distribution of MAX is given in the work of Zheng et al.13. Since its being raised,
MAX has been widely used and investigated. Li et al.14 introduced a selection procedure based on the rank of
MAX. Kim et al.15 proposed a SNP selection method based on MAX and a penalized support vector machine
strategy.
Though CATTs and MAX have concise forms and are extensively used, theoretical properties for the
applications of CATTs and MAX to GWAS have not been investigated. To control false discovery rate (FDR) in
GWAS, Bonferroni correction strategy and FDR control procedures, such as Benjamini–Hochberg procedure, are
two widely used strategies. But they both assume that all the SNPs are independent, which certainly is improperly
since linkage disequilibrium usually exists among SNPs and may lead to omission on related SNPs. Considering
these drawbacks, feature screening methods are sensible alternatives. Rather than select the associated SNPs
directly, feature screening approaches aim to eliminate most of the irrelevant SNPs at first. After a screening
procedure, there remains only a small amount of SNPs and researchers can concentrate on these remaining SNPs,
which can save much time and work.
In the last few years, feature screening methods have been proposed for various situations. Fan and L
v16
first proposed a screening method called the sure independence screening approach for Gaussian response
and predictors under linear regressions. Since then, sure screening property, which retains all the important
predictors with high probability as the sample size goes into infinity, has been regarded as a feature screening
criterion. Many screening procedures have been developed for diverse models, such as the generalized linear
model17 and additive m
odel18 among others. Although many procedures can be directly applied to GWAS with
corresponding models and data types, only PC-SIS, proposed in the work of Huang et al.19, is applicable to the
considered situation where both the outcome and predictors are categorical. However, PC-SIS does not take the
1
Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, People’s Republic
of China. 2University of Chinese Academy of Sciences, Beijing 100049, People’s Republic of China. 3School of
Mathematics and Statistics, Hubei Normal University, Huangshi 435002, People’s Republic of China. 4School of
Mathematics and Statistics, Beijing Institute of Technology, Beijing 100081, People’s Republic of China. *email:
Scientific Reports |
(2023) 13:9139
| https://doi.org/10.1038/s41598-023-35929-4
1
Vol.:(0123456789)
www.nature.com/scientificreports/
information on genetic model into consideration. Just as mentioned above, CATTs and MAX test consider this
information in the association analysis. But their screening properties have not been studied yet. To fill this gap,
we propose feature screening methods based on CATTs in different genetic models and MAX test, and investigate
their sure screening and rank consistency properties.
The rest of paper is organised as follows. In “Trend test”, we briefly describe the trend tests which can be
used to evaluate the relationship between a binary variable and a genotype variable. “Independence screening
procedure” introduces the independence screening procedures based on the adjusted trend test statistics, and
presents sure screening and ranking consistency properties. Simulation studies are conducted in “Simulation
studies” . And a case study on type 1 diabetes is demonstrated in “Application to a real dataset”. A conclusion
for this work is presented in “Conclusion”. All proofs of theorems are provided in the Supplemental Materials.
Trend test
CATT evaluates the association between a binary variable and a SNP, and is widely used in case-control genetic
data analysis. Compared with Pearson chi-square test, it makes use of the underlying genetic model. Its specific
form is as follows. Suppose r cases and s controls are enrolled in the study. For a given SNP, the genotypes can
be expressed as aa, Aa and AA, respectively, with A being a high risk candidate allele. In the sample of cases, the
counts of aa, Aa and AA are r0 , r1 and r2, respectively. And the corresponding counts in the control samples are
s0 , s1 and s2. Thus we have r = r0 + r1 + r2 , s = s0 + s1 + s2. Denote n = r + s and ni = ri + si for i = 0, 1, 2.
All these counts are displayed in Table 1. Then CATT can be written as
Z=
2
√
n
Xi (sri − rsi )
i=0
2
2
rs n
Xi2 ni − ( Xi ni )2
i=0
,
(1)
i=0
where (X0 , X1 , X2 ) is a pre-defined genotype score vector. Note that the optimal score vector for CATT varies
across dif (...truncated)