A comparison of statistical methods for genomic selection in a mice population
Haroldo HR Neves
Roberto Carvalheiro
Sandra A Queiroz
Background: The availability of high-density panels of SNP markers has opened new perspectives for marker-assisted selection strategies, such that genotypes for these markers are used to predict the genetic merit of selection candidates. Because the number of markers is often much larger than the number of phenotypes, marker effect estimation is not a trivial task. The objective of this research was to compare the predictive performance of ten different statistical methods employed in genomic selection, by analyzing data from a heterogeneous stock mice population. Results: For the five traits analyzed (W6W: weight at six weeks, WGS: growth slope, BL: body length, %CD8+: percentage of CD8+ cells, CD4+/ CD8+: ratio between CD4+ and CD8+ cells), within-family predictions were more accurate than across-family predictions, although this superiority in accuracy varied markedly across traits. For within-family prediction, two kernel methods, Reproducing Kernel Hilbert Spaces Regression (RKHS) and Support Vector Regression (SVR), were the most accurate for W6W, while a polygenic model also had comparable performance. A form of ridge regression assuming that all markers contribute to the additive variance (RR_GBLUP) figured among the most accurate for WGS and BL, while two variable selection methods ( LASSO and Random Forest, RF) had the greatest predictive abilities for %CD8+ and CD4+/ CD8+. RF, RKHS, SVR and RR_GBLUP outperformed the remainder methods in terms of bias and inflation of predictions. Conclusions: Methods with large conceptual differences reached very similar predictive abilities and a clear re-ranking of methods was observed in function of the trait analyzed. Variable selection methods were more accurate than the remainder in the case of %CD8+ and CD4+/CD8+ and these traits are likely to be influenced by a smaller number of QTL than the remainder. Judged by their overall performance across traits and computational requirements, RR_GBLUP, RKHS and SVR are particularly appealing for application in genomic selection.
-
Background
The availability of high-density panels of single
nucleotide polymorphisms (SNP) containing thousands of
markers opened new perspectives for the study of complex
diseases, while has enhanced marker-assisted selection
strategies in animal and plant breeding.
The possibility to predict accurately the genetic merit
of selection candidates based on their genotypes for SNP
markers, a process known as genomic selection [1], is
revolutionizing breeding schemes. The reasoning of this
process is that whenever marker density is high enough,
most QTL will be in high linkage disequilibrium (LD)
with some markers and estimates of marker effects will
lead to accurate predictions of genetic merit for a trait.
Despite this, the amount of information to be analyzed
in this situation poses new challenges from statistical
and computational viewpoints. As the number of
predictor variables (markers) is generally much higher than
the number of observations (phenotypes), there is lack
of degrees of freedom to estimate all marker effects
simultaneously, what is aggravated by the fact that models
may suffer from multicollinearity, especially because
markers in close positions are expected to be highly
correlated.
According to review in [2], some of the alternatives that
have been employed to overcome these issues are fitting
markers as random effects (e.g. shrinkage estimation and
Bayesian regression) or applying some dimensionality
reduction technique or machine learning method, although
there is no consensus on the most appropriate method for
genomic predictions. It has been argued that shrinkage
methods with assumptions close to the infinitesimal
model (i.e. GBLUP and its variants) are robust with
respect to the underlying genetic architecture of the
traits, while methods based on some sort of variable
selection are more sensitive to the genetic background of
traits [3,4].
There are still few extensive studies aimed to compare
predictive performance of the such methods in plants or
in animals [5]. In the present study, we analyze a
publicly available dataset, including pedigree, genotypic and
phenotypic information of a mice population. Although
this same dataset had already been analyzed previously
[6,7,8], we focus on a broader comparison of statistical
methods employed for genomic prediction, by studying
five traits that probably have considerable differences in
terms of genetic architecture.
Thus, the objective of this research was to compare
the predictive performance of ten different statistical
methods employed in genomic selection by using data
from a heterogeneous stock mice population, aiming to
provide some insight in the scope of statistical methods
useful for genomic selection and in the interplay
between the genetic background of traits and the
performance of these methods.
Methods
Data
The data came from a heterogeneous stock mice
population kept by The Welcome Trust Centre for Human
Genetics (WTCHG) (data are available at http://gscan.
well.ox.ac.uk). Briefly, this population was generated
from the crossing of eight inbred lines, followed by 50
generations of random mating. As a result, this
population exhibits a high level of linkage disequilibrium, even
for pairs of markers separated by until 2Mb [9]. When
considering genotypic information obtained with a panel
with 11,558 SNP markers and average inter-marker
distance of 204 kb, the average r2 between adjacent markers
was about 0.62 [6]. This amount of LD enhanced QTL
mapping for complex traits in mice [10] and would be
equally helpful in the context of genomic selection,
besides the fact that knowledge of the origin of this
population could improve interpretability of the results.
Only animals with both genotypes and phenotypes
were considered and details of sampling and genotyping
are described in Valdar et al. [11]. The raw data included
genotypes for 12,226 SNP markers located in autosomes
of 1,940 animals. Data were edited such that only
polymorphic markers with MAF 5% and with no evidence
of departure from Hardy-Weinberg equilibrium were
considered in analyses.
Missing genotypes (0.1%) were imputed using
probabilistic PCA (PPCA, [12]). Although the accuracy of this
procedure is slightly lower than that of other methods,
computing time is much lower. In addition, the
proportion of missing genotypes is small enough to neglect the
effects of imputation. After data editing, a dataset
including information of 1,884 animals for 9,917 markers was
considered in marker effect estimation, such that 168
full-sib families with average size of 11 were represented.
Five traits whose heritabilities are quite different were
analyzed: percentage of CD8+ cells (%CD8+, h2=0.89),
ratio between CD4+ and CD8+ cells (CD4+/ CD8+,
h2=0.80), body weight at 6 weeks (W6W, h2 = 0.74),
growth slope (WGS, h2=0.30), body length (BL, h2=0.13)
[11]. Aim (...truncated)