Computational strategies for alternative single-step Bayesian regression models with large numbers of genotyped and non-genotyped animals
Fernando et al. Genet Sel Evol (2016) 48:96
DOI 10.1186/s12711-016-0273-2
Ge n e t i c s
Se l e c t i o n
Ev o l u t i o n
Open Access
RESEARCH ARTICLE
Computational strategies
for alternative single‑step Bayesian regression
models with large numbers of genotyped
and non‑genotyped animals
Rohan L. Fernando1* , Hao Cheng1, Bruce L. Golden2 and Dorian J. Garrick1,3
Abstract
Background: Two types of models have been used for single-step genomic prediction and genome-wide association studies that include phenotypes from both genotyped animals and their non-genotyped relatives. The two types
are breeding value models (BVM) that fit breeding values explicitly and marker effects models (MEM) that express the
breeding values in terms of the effects of observed or imputed genotypes. MEM can accommodate a wider class of
analyses, including variable selection or mixture model analyses. The order of the equations that need to be solved
and the inverses required in their construction vary widely, and thus the computational effort required depends upon
the size of the pedigree, the number of genotyped animals and the number of loci.
Theory: We present computational strategies to avoid storing large, dense blocks of the MME that involve imputed
genotypes. Furthermore, we present a hybrid model that fits a MEM for animals with observed genotypes and a BVM
for those without genotypes. The hybrid model is computationally attractive for pedigree files containing millions of
animals with a large proportion of those being genotyped.
Application: We demonstrate the practicality on both the original MEM and the hybrid model using real data with
6,179,960 animals in the pedigree with 4,934,101 phenotypes and 31,453 animals genotyped at 40,214 informative
loci. To complete a single-trait analysis on a desk-top computer with four graphics cards required about 3 h using the
hybrid model to obtain both preconditioned conjugate gradient solutions and 42,000 Markov chain Monte-Carlo
(MCMC) samples of breeding values, which allowed making inferences from posterior means, variances and covariances. The MCMC sampling required one quarter of the effort when the hybrid model was used compared to the
published MEM.
Conclusions: We present a hybrid model that fits a MEM for animals with genotypes and a BVM for those without
genotypes. Its practicality and considerable reduction in computing effort was demonstrated. This model can readily
be extended to accommodate multiple traits, multiple breeds, maternal effects, and additional random effects such
as polygenic residual effects.
Background
Two types of equivalent mixed linear models are used
for whole-genome analyses in livestock [1]. The first
type, which we refer to as marker effects models (MEM),
*Correspondence:
1
Department of Animal Science, Iowa State University,
Ames, IA 50011, USA
Full list of author information is available at the end of the article
includes random effects (α) of marker genotype covariates (Mg ) in the model [2, 3]. The second type, which we
refer to as breeding value models (BVM), includes the
breeding values of the animals, ug = Mg α, as a random
effect that has a covariance computed from Mg [1, 2, 4–6]
rather than from the pedigree.
It was shown that the BVM can be adapted for what
is known as single-step genomic best linear unbiased
© The Author(s) 2016. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License
(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium,
provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license,
and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/
publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Fernando et al. Genet Sel Evol (2016) 48:96
prediction (SS-GBLUP) that combines information
from animals with genotypes and from those without genotypes in a single BLUP analysis [7–9]. However, the SS-GBLUP analysis requires computing the
inverse of G, which is the matrix of genomic relationships of the animals with genotypes [8, 9]. When the
number Ng of genotyped animals exceeds the number
of markers, G is singular, but a full-rank matrix such as
G∗ = 0.95G + 0.05A, with A being the pedigree-based
relationship matrix might be used in its place. Singlestep analyses based on the MEM do not require computing G or its inverse [10]. Furthermore, Bayesian
regression analyses based on the MEM are not limited
to assuming a normal prior for α, which is implicit in
SS-GBLUP; Bayesian regression models can accommodate various priors including the t distribution
as in BayesA [3, 11], the double exponential distribution as in Bayesian LASSO [12] or mixtures of the t
distribution or the normal distribution [3, 11, 13] as
in BayesB or BayesC. However, the MME that correspond to single-step MEM (SS-MEM) types of models
contain dense blocks that correspond to the imputed
genotypes of animals with missing genotypes [10], and
those blocks can be large if many animals have missing
genotypes.
Liu et al. [14] developed a single-step method based on
the BVM with direct estimation of marker effects (SSMEGBLUP). An advantage of that method over SS-GBLUP is
that it does not require computing G or its inverse. Also,
their method can be used for Bayesian regression models [14]. However, the MME for SSME-GBLUP contains
expressions that involve the inverse of the pedigree-based
relationship matrix, Agg , for the animals with genotypes.
This is a dense matrix, and therefore a computational
strategy was proposed to avoid computing its inverse but
it requires solving a dense system of equations of order
Ng within each round of Jacobi or pre-conditioned conjugate gradient (PCG) iteration for solution of the MME
or within each round of MCMC sampling for Bayesian inference with models such as BayesA or BayesB [3].
Equation (A1) in Legarra and Ducrocq [15] also present
a set of similar MME with marker effects for genotyped
animals and breeding values for non-genotyped animals.
As with the MME in Liu et al. [14], the advantage of the
MME of Legarra and Ducrocq [15] is that they do not
require the computation of G or its inverse but require
computing the inverse of Agg . Recently, in some livestock
such as dairy cattle, Ng has increased towards a million
or more, and thus, solving a dense system of equations of
order Ng within each round of iteration will place a heavy
burden on SSME-GBLUP in computing time and storage
requirements.
Page 2 of 8
The objective of this paper is to present computational
strategies for whole-genome analyses based on the SSMEM that avoid storing large, dense blocks of the MME
that involve imputed genotypes. First, we will show this
for the MME given in [10]. Second, we will present what
we refer to as a (...truncated)