Computational strategies for alternative single-step Bayesian regression models with large numbers of genotyped and non-genotyped animals (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.gsejournal.org/content/pdf/s12711-016-0273-2.pdf

Computational strategies for alternative single-step Bayesian regression models with large numbers of genotyped and non-genotyped animals

Fernando et al. Genet Sel Evol (2016) 48:96 DOI 10.1186/s12711-016-0273-2 Ge n e t i c s Se l e c t i o n Ev o l u t i o n Open Access RESEARCH ARTICLE Computational strategies for alternative single‑step Bayesian regression models with large numbers of genotyped and non‑genotyped animals Rohan L. Fernando1* , Hao Cheng1, Bruce L. Golden2 and Dorian J. Garrick1,3 Abstract Background: Two types of models have been used for single-step genomic prediction and genome-wide association studies that include phenotypes from both genotyped animals and their non-genotyped relatives. The two types are breeding value models (BVM) that fit breeding values explicitly and marker effects models (MEM) that express the breeding values in terms of the effects of observed or imputed genotypes. MEM can accommodate a wider class of analyses, including variable selection or mixture model analyses. The order of the equations that need to be solved and the inverses required in their construction vary widely, and thus the computational effort required depends upon the size of the pedigree, the number of genotyped animals and the number of loci. Theory: We present computational strategies to avoid storing large, dense blocks of the MME that involve imputed genotypes. Furthermore, we present a hybrid model that fits a MEM for animals with observed genotypes and a BVM for those without genotypes. The hybrid model is computationally attractive for pedigree files containing millions of animals with a large proportion of those being genotyped. Application: We demonstrate the practicality on both the original MEM and the hybrid model using real data with 6,179,960 animals in the pedigree with 4,934,101 phenotypes and 31,453 animals genotyped at 40,214 informative loci. To complete a single-trait analysis on a desk-top computer with four graphics cards required about 3 h using the hybrid model to obtain both preconditioned conjugate gradient solutions and 42,000 Markov chain Monte-Carlo (MCMC) samples of breeding values, which allowed making inferences from posterior means, variances and covariances. The MCMC sampling required one quarter of the effort when the hybrid model was used compared to the published MEM. Conclusions: We present a hybrid model that fits a MEM for animals with genotypes and a BVM for those without genotypes. Its practicality and considerable reduction in computing effort was demonstrated. This model can readily be extended to accommodate multiple traits, multiple breeds, maternal effects, and additional random effects such as polygenic residual effects. Background Two types of equivalent mixed linear models are used for whole-genome analyses in livestock [1]. The first type, which we refer to as marker effects models (MEM), *Correspondence: 1 Department of Animal Science, Iowa State University, Ames, IA 50011, USA Full list of author information is available at the end of the article includes random effects (α) of marker genotype covariates (Mg ) in the model [2, 3]. The second type, which we refer to as breeding value models (BVM), includes the breeding values of the animals, ug = Mg α, as a random effect that has a covariance computed from Mg [1, 2, 4–6] rather than from the pedigree. It was shown that the BVM can be adapted for what is known as single-step genomic best linear unbiased © The Author(s) 2016. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Fernando et al. Genet Sel Evol (2016) 48:96 prediction (SS-GBLUP) that combines information from animals with genotypes and from those without genotypes in a single BLUP analysis [7–9]. However, the SS-GBLUP analysis requires computing the inverse of G, which is the matrix of genomic relationships of the animals with genotypes [8, 9]. When the number Ng of genotyped animals exceeds the number of markers, G is singular, but a full-rank matrix such as G∗ = 0.95G + 0.05A, with A being the pedigree-based relationship matrix might be used in its place. Singlestep analyses based on the MEM do not require computing G or its inverse [10]. Furthermore, Bayesian regression analyses based on the MEM are not limited to assuming a normal prior for α, which is implicit in SS-GBLUP; Bayesian regression models can accommodate various priors including the t distribution as in BayesA [3, 11], the double exponential distribution as in Bayesian LASSO [12] or mixtures of the t distribution or the normal distribution [3, 11, 13] as in BayesB or BayesC. However, the MME that correspond to single-step MEM (SS-MEM) types of models contain dense blocks that correspond to the imputed genotypes of animals with missing genotypes [10], and those blocks can be large if many animals have missing genotypes. Liu et al. [14] developed a single-step method based on the BVM with direct estimation of marker effects (SSMEGBLUP). An advantage of that method over SS-GBLUP is that it does not require computing G or its inverse. Also, their method can be used for Bayesian regression models [14]. However, the MME for SSME-GBLUP contains expressions that involve the inverse of the pedigree-based relationship matrix, Agg , for the animals with genotypes. This is a dense matrix, and therefore a computational strategy was proposed to avoid computing its inverse but it requires solving a dense system of equations of order Ng within each round of Jacobi or pre-conditioned conjugate gradient (PCG) iteration for solution of the MME or within each round of MCMC sampling for Bayesian inference with models such as BayesA or BayesB [3]. Equation (A1) in Legarra and Ducrocq [15] also present a set of similar MME with marker effects for genotyped animals and breeding values for non-genotyped animals. As with the MME in Liu et al. [14], the advantage of the MME of Legarra and Ducrocq [15] is that they do not require the computation of G or its inverse but require computing the inverse of Agg . Recently, in some livestock such as dairy cattle, Ng has increased towards a million or more, and thus, solving a dense system of equations of order Ng within each round of iteration will place a heavy burden on SSME-GBLUP in computing time and storage requirements. Page 2 of 8 The objective of this paper is to present computational strategies for whole-genome analyses based on the SSMEM that avoid storing large, dense blocks of the MME that involve imputed genotypes. First, we will show this for the MME given in [10]. Second, we will present what we refer to as a (...truncated)