Identifying quantitative trait loci via group-sparse multitask regression and feature selection: an imaging genetics study of the ADNI cohort (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/28/2/229.full.pdf

Identifying quantitative trait loci via group-sparse multitask regression and feature selection: an imaging genetics study of the ADNI cohort

Hua Wang 1 Feiping Nie 1 Heng Huang 1 Sungeun Kim 0 Kwangsik Nho 0 Shannon L. Risacher 0 Andrew J. Saykin 0 Li Shen 0 For the Alzheimer's Disease Neuroimaging Initiative 0 Department of Radiology and Imaging Sciences, Indiana University School of Medicine , 950 W. Walnut St, R2 E124F, Indianapolis, IN 46202 , USA Associate Editor: Jeffrey Barrett 1 Department of Computer Science and Engineering, University of Texas at Arlington , Arlington, TX 76019 , USA Motivation: Recent advances in high-throughput genotyping and brain imaging techniques enable new approaches to study the influence of genetic variation on brain structures and functions. Traditional association studies typically employ independent and pairwise univariate analysis, which treats single nucleotide polymorphisms (SNPs) and quantitative traits (QTs) as isolated units and ignores important underlying interacting relationships between the units. New methods are proposed here to overcome this limitation. Results: Taking into account the interlinked structure within and between SNPs and imaging QTs, we propose a novel Group-Sparse Multi-task Regression and Feature Selection (G-SMuRFS) method to identify quantitative trait loci for multiple disease-relevant QTs and apply it to a study in mild cognitive impairment and Alzheimer's disease. Built upon regression analysis, our model uses a new form of regularization, group 2,1-norm (G2,1-norm), to incorporate the biological group structures among SNPs induced from their genetic arrangement. The new G2,1-norm considers the regression coefficients of all the SNPs in each group with respect to all the QTs together and enforces sparsity at the group level. In addition, an 2,1-norm regularization is utilized to couple feature selection across multiple tasks to make use of the shared underlying mechanism among different brain regions. The effectiveness of the proposed method is demonstrated by both clearly improved prediction performance in empirical evaluations and a compact set of selected SNP predictors relevant to the imaging QTs. Availability: Software is publicly available at: http://ranger.uta.edu/ %7eheng/ imaging-genetics/ Contact: ; To whom correspondence should be addressed. The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. Data used in preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.ucla.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.ucla.edu/wpcontent/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf. 1 INTRODUCTION Imaging genetics is an emergent transdisciplinary research field, where the associations between genetic variations and imaging measures as quantitative traits (QTs) or continuous phenotypes are evaluated. Compared to casecontrol status, the QTs have increased statistical power and are closer to the underlying biological etiology of the disease making it easier to identify underlying genes (Braskie et al., 2011; Potkin et al., 2009; Shen et al., 2010; Stein et al., 2010; Yip and Lange, 2011; Zhan et al., 2011). Genome-wide association studies (GWAS) have been increasingly performed to correlate high-throughput single nucleotide polymorphism (SNP) data to large-scale image data. While many studies employed a hypothesis-driven approach by making significant reduction in one or both data types (Glahn et al., 2007), some recent studies examined these associations at the whole genome entire brain level (Shen et al., 2010; Stein et al., 2010). Pairwise univariate analysis was typically used in traditional association studies to quickly provide important association information between SNPs and QTs. However, it treated the SNPs and the QTs as independent and isolated units, and therefore the underlying interacting relationships between the units might be lost. Multivariate methods to examine joint effect of multilocus genotype on a single phenotype were studied in general genetic association studies (Ballard et al., 2010; Wu et al., 2010) as well as several recent imaging genetic studies (Bralten et al., 2011; Hibar et al., 2011). This paradigm did not consider the relationship between interlinked brain phenotypes and thus still had limited power in revealing complex imaging genetic associations. In this work, taking into account the interrelated structure within and between SNPs and QTs, we propose a new framework for effectively identifying quantitative trait loci, which addresses the following challenges in imaging genetics association study. First, traditional association studies consider all the SNPs evenly distributed and assess each SNP individually. However, certain SNPs are naturally connected via different pathways. Multiple SNPs from one gene often jointly carry out genetic functionalities. Moreover, linkage disequilibrium (LD) (Barrett et al., 2005) describes the nonrandom association between alleles at different loci, through which the SNPs in high LD are linked together in meiosis. Thus, instead of treating SNPs in an isolated manner, it would be beneficial to exploit the group structure among SNPs. Second, because the functionality of the human brain typically involves more than one cerebral component, investigating each individual regional brain phenotype separately will inevitably lose the interacting relationships between them. For example, the brains episodic memory network, including medial temporal lobe (MTL) structures, medial and lateral parietal, and prefrontal cortical areas, are normally engaged together during episodic recall (Walhovd et al., 2010). In addition, accurate prediction of disease status and progression are typically implicated by multiple brain regions coupled with other biomarkers (Hinrichs et al., 2011; Zhang et al., 2011). Therefore, jointly analyzing all the imaging phenotypes via one single integral regression model is desirable to elucidate the shared mechanism that may be hidden otherwise. By recognizing the interrelated nature of these genotypes and phenotypes, in this study, we propose a novel Group-Sparse Multitask Regression and Feature Selection (G-SMuRFS) method to identify quantitative trait loci in a mild cognitive impairment (MCI) and Alzheimers disease (AD) study using a few important imaging QTs relevant to AD. We consider each SNP as a feature and each QT as a response variable (i.e. a learning task), and formulate a multitask regression framework including multiple features (SNPs) and multiple responses (QTs). Our goal is to reveal the relationships between these genetic features and imaging phenotypes. The proposed model consists of three major components. First, it is built upon regression analysis due to the continuous responses (...truncated)