A comparison of statistical methods for genomic selection in a mice population
Research article Open Access
A comparison of statistical methods for genomic selection in a mice population
Haroldo HR Neves†1Email author, Roberto Carvalheiro†2 and Sandra A Queiroz†1
†Contributed equally
BMC Genetics201213:100
https://doi.org/10.1186/1471-2156-13-100
© Neves et al.; licensee BioMed Central Ltd. 2012
Received: 14 August 2012Accepted: 31 October 2012Published: 8 November 2012
Abstract
Background
The availability of high-density panels of SNP markers has opened new perspectives for marker-assisted selection strategies, such that genotypes for these markers are used to predict the genetic merit of selection candidates. Because the number of markers is often much larger than the number of phenotypes, marker effect estimation is not a trivial task. The objective of this research was to compare the predictive performance of ten different statistical methods employed in genomic selection, by analyzing data from a heterogeneous stock mice population.
Results
For the five traits analyzed (W6W: weight at six weeks, WGS: growth slope, BL: body length, %CD8+: percentage of CD8+ cells, CD4+/ CD8+: ratio between CD4+ and CD8+ cells), within-family predictions were more accurate than across-family predictions, although this superiority in accuracy varied markedly across traits. For within-family prediction, two kernel methods, Reproducing Kernel Hilbert Spaces Regression (RKHS) and Support Vector Regression (SVR), were the most accurate for W6W, while a polygenic model also had comparable performance. A form of ridge regression assuming that all markers contribute to the additive variance (RR_GBLUP) figured among the most accurate for WGS and BL, while two variable selection methods ( LASSO and Random Forest, RF) had the greatest predictive abilities for %CD8+ and CD4+/ CD8+. RF, RKHS, SVR and RR_GBLUP outperformed the remainder methods in terms of bias and inflation of predictions.
Conclusions
Methods with large conceptual differences reached very similar predictive abilities and a clear re-ranking of methods was observed in function of the trait analyzed. Variable selection methods were more accurate than the remainder in the case of %CD8+ and CD4+/CD8+ and these traits are likely to be influenced by a smaller number of QTL than the remainder. Judged by their overall performance across traits and computational requirements, RR_GBLUP, RKHS and SVR are particularly appealing for application in genomic selection.
Keywords
Kernel regressionLASSORandom forestRidge regressionSNPSubset selection
Background
The availability of high-density panels of single nucleotide polymorphisms (SNP) containing thousands of markers opened new perspectives for the study of complex diseases, while has enhanced marker-assisted selection strategies in animal and plant breeding.
The possibility to predict accurately the genetic merit of selection candidates based on their genotypes for SNP markers, a process known as genomic selection [1], is revolutionizing breeding schemes. The reasoning of this process is that whenever marker density is high enough, most QTL will be in high linkage disequilibrium (LD) with some markers and estimates of marker effects will lead to accurate predictions of genetic merit for a trait.
Despite this, the amount of information to be analyzed in this situation poses new challenges from statistical and computational viewpoints. As the number of predictor variables (markers) is generally much higher than the number of observations (phenotypes), there is lack of degrees of freedom to estimate all marker effects simultaneously, what is aggravated by the fact that models may suffer from multicollinearity, especially because markers in close positions are expected to be highly correlated.
According to review in [2], some of the alternatives that have been employed to overcome these issues are fitting markers as random effects (e.g. shrinkage estimation and Bayesian regression) or applying some dimensionality reduction technique or machine learning method, although there is no consensus on the most appropriate method for genomic predictions. It has been argued that shrinkage methods with assumptions close to the infinitesimal model (i.e. GBLUP and its variants) are robust with respect to the underlying genetic architecture of the traits, while methods based on some sort of variable selection are more sensitive to the genetic background of traits [3, 4].
There are still few extensive studies aimed to compare predictive performance of the such methods in plants or in animals [5]. In the present study, we analyze a publicly available dataset, including pedigree, genotypic and phenotypic information of a mice population. Although this same dataset had already been analyzed previously [6–8], we focus on a broader comparison of statistical methods employed for genomic prediction, by studying five traits that probably have considerable differences in terms of genetic architecture.
Thus, the objective of this research was to compare the predictive performance of ten different statistical methods employed in genomic selection by using data from a heterogeneous stock mice population, aiming to provide some insight in the scope of statistical methods useful for genomic selection and in the interplay between the genetic background of traits and the performance of these methods.
Methods
Data
The data came from a heterogeneous stock mice population kept by The Welcome Trust Centre for Human Genetics (WTCHG) (data are available at http://gscan.well.ox.ac.uk). Briefly, this population was generated from the crossing of eight inbred lines, followed by 50 generations of random mating. As a result, this population exhibits a high level of linkage disequilibrium, even for pairs of markers separated by until 2Mb [9]. When considering genotypic information obtained with a panel with 11,558 SNP markers and average inter-marker distance of 204 kb, the average r2 between adjacent markers was about 0.62 [6]. This amount of LD enhanced QTL mapping for complex traits in mice [10] and would be equally helpful in the context of genomic selection, besides the fact that knowledge of the origin of this population could improve interpretability of the results.
Only animals with both genotypes and phenotypes were considered and details of sampling and genotyping are described in Valdar et al. [11]. The raw data included genotypes for 12,226 SNP markers located in autosomes of 1,940 animals. Data were edited such that only polymorphic markers with MAF ≥ 5% and with no evidence of departure from Hardy-Weinberg equilibrium were considered in analyses.
Missing genotypes (0.1%) were imputed using probabilistic PCA (PPCA, [12]). Although the accuracy of this procedure is slightly lower than that of other methods, computing time is much lower. In addition, the proportion of missing genotypes is small enough to neglect the effects of imputation. After data (...truncated)