Identifying quantitative trait loci via group-sparse multitask regression and feature selection: an imaging genetics study of the ADNI cohort
Hua Wang
1
Feiping Nie
1
Heng Huang
1
Sungeun Kim
0
Kwangsik Nho
0
Shannon L. Risacher
0
Andrew J. Saykin
0
Li Shen
0
For the Alzheimer's Disease Neuroimaging Initiative
0
Department of Radiology and Imaging Sciences, Indiana University School of Medicine
,
950 W. Walnut St, R2 E124F, Indianapolis, IN 46202
,
USA Associate Editor: Jeffrey Barrett
1
Department of Computer Science and Engineering, University of Texas at Arlington
,
Arlington, TX 76019
,
USA
Motivation: Recent advances in high-throughput genotyping and brain imaging techniques enable new approaches to study the influence of genetic variation on brain structures and functions. Traditional association studies typically employ independent and pairwise univariate analysis, which treats single nucleotide polymorphisms (SNPs) and quantitative traits (QTs) as isolated units and ignores important underlying interacting relationships between the units. New methods are proposed here to overcome this limitation. Results: Taking into account the interlinked structure within and between SNPs and imaging QTs, we propose a novel Group-Sparse Multi-task Regression and Feature Selection (G-SMuRFS) method to identify quantitative trait loci for multiple disease-relevant QTs and apply it to a study in mild cognitive impairment and Alzheimer's disease. Built upon regression analysis, our model uses a new form of regularization, group 2,1-norm (G2,1-norm), to incorporate the biological group structures among SNPs induced from their genetic arrangement. The new G2,1-norm considers the regression coefficients of all the SNPs in each group with respect to all the QTs together and enforces sparsity at the group level. In addition, an 2,1-norm regularization is utilized to couple feature selection across multiple tasks to make use of the shared underlying mechanism among different brain regions. The effectiveness of the proposed method is demonstrated by both clearly improved prediction performance in empirical evaluations and a compact set of selected SNP predictors relevant to the imaging QTs. Availability: Software is publicly available at: http://ranger.uta.edu/ %7eheng/ imaging-genetics/ Contact: ; To whom correspondence should be addressed. The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. Data used in preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.ucla.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.ucla.edu/wpcontent/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.
1 INTRODUCTION
Imaging genetics is an emergent transdisciplinary research field,
where the associations between genetic variations and imaging
measures as quantitative traits (QTs) or continuous phenotypes are
evaluated. Compared to casecontrol status, the QTs have increased
statistical power and are closer to the underlying biological etiology
of the disease making it easier to identify underlying genes (Braskie
et al., 2011; Potkin et al., 2009; Shen et al., 2010; Stein et al.,
2010; Yip and Lange, 2011; Zhan et al., 2011). Genome-wide
association studies (GWAS) have been increasingly performed to
correlate high-throughput single nucleotide polymorphism (SNP)
data to large-scale image data. While many studies employed a
hypothesis-driven approach by making significant reduction in one
or both data types (Glahn et al., 2007), some recent studies examined
these associations at the whole genome entire brain level (Shen
et al., 2010; Stein et al., 2010). Pairwise univariate analysis was
typically used in traditional association studies to quickly provide
important association information between SNPs and QTs. However,
it treated the SNPs and the QTs as independent and isolated units, and
therefore the underlying interacting relationships between the units
might be lost. Multivariate methods to examine joint effect of
multilocus genotype on a single phenotype were studied in general genetic
association studies (Ballard et al., 2010; Wu et al., 2010) as well as
several recent imaging genetic studies (Bralten et al., 2011; Hibar
et al., 2011). This paradigm did not consider the relationship between
interlinked brain phenotypes and thus still had limited power in
revealing complex imaging genetic associations. In this work, taking
into account the interrelated structure within and between SNPs
and QTs, we propose a new framework for effectively identifying
quantitative trait loci, which addresses the following challenges in
imaging genetics association study.
First, traditional association studies consider all the SNPs evenly
distributed and assess each SNP individually. However, certain SNPs
are naturally connected via different pathways. Multiple SNPs from
one gene often jointly carry out genetic functionalities. Moreover,
linkage disequilibrium (LD) (Barrett et al., 2005) describes the
nonrandom association between alleles at different loci, through which
the SNPs in high LD are linked together in meiosis. Thus, instead
of treating SNPs in an isolated manner, it would be beneficial to
exploit the group structure among SNPs.
Second, because the functionality of the human brain typically
involves more than one cerebral component, investigating each
individual regional brain phenotype separately will inevitably lose
the interacting relationships between them. For example, the brains
episodic memory network, including medial temporal lobe (MTL)
structures, medial and lateral parietal, and prefrontal cortical areas,
are normally engaged together during episodic recall (Walhovd
et al., 2010). In addition, accurate prediction of disease status
and progression are typically implicated by multiple brain regions
coupled with other biomarkers (Hinrichs et al., 2011; Zhang et al.,
2011). Therefore, jointly analyzing all the imaging phenotypes via
one single integral regression model is desirable to elucidate the
shared mechanism that may be hidden otherwise.
By recognizing the interrelated nature of these genotypes and
phenotypes, in this study, we propose a novel Group-Sparse
Multitask Regression and Feature Selection (G-SMuRFS) method to
identify quantitative trait loci in a mild cognitive impairment (MCI)
and Alzheimers disease (AD) study using a few important imaging
QTs relevant to AD. We consider each SNP as a feature and each
QT as a response variable (i.e. a learning task), and formulate a
multitask regression framework including multiple features (SNPs)
and multiple responses (QTs). Our goal is to reveal the relationships
between these genetic features and imaging phenotypes.
The proposed model consists of three major components.
First, it is built upon regression analysis due to the continuous
responses (...truncated)