Predicting Unobserved Phenotypes for Complex Traits from Whole-Genome SNP Data
Visscher PM (2008) Predicting Unobserved Phenotypes for Complex Traits from Whole-Genome SNP
Data. PLoS Genet 4(10): e1000231. doi:10.1371/journal.pgen.1000231
Predicting Unobserved Phenotypes for Complex Traits from Whole-Genome SNP Data
Sang Hong Lee 0
Julius H. J. van der Werf 0
Ben J. Hayes 0
Michael E. Goddard 0
Peter M. Visscher 0
Bret A. Payseur, University of Wisconsin, Madison, United States of America
0 1 School of Environmental and Rural Science, University of New England , Armidale, New South Wales , Australia , 2 National Institute of Animal Science, Rural Development Administration, Cheon An, Korea, 3 Department of Primary Industry , Victoria , Australia , 4 Faculty of Land and Food Resources, University of Melbourne , Melbourne , Australia , 5 Queensland Institute of Medical Research , Brisbane , Australia
Genome-wide association studies (GWAS) for quantitative traits and disease in humans and other species have shown that there are many loci that contribute to the observed resemblance between relatives. GWAS to date have mostly focussed on discovery of genes or regulatory regions habouring causative polymorphisms, using single SNP analyses and setting stringent type-I error rates. Genome-wide marker data can also be used to predict genetic values and therefore predict phenotypes. Here, we propose a Bayesian method that utilises all marker data simultaneously to predict phenotypes. We apply the method to three traits: coat colour, %CD8 cells, and mean cell haemoglobin, measured in a heterogeneous stock mouse population. We find that a model that contains both additive and dominance effects, estimated from genome-wide marker data, is successful in predicting unobserved phenotypes and is significantly better than a prediction based upon the phenotypes of close relatives. Correlations between predicted and actual phenotypes were in the range of 0.4 to 0.9 when half of the number of families was used to estimate effects and the other half for prediction. Posterior probabilities of SNPs being associated with coat colour were high for regions that are known to contain loci for this trait. The prediction of phenotypes using large samples, high-density SNP data, and appropriate statistical methodology is feasible and can be applied in human medicine, forensics, or artificial selection programs.
-
Results from linkage analyses and, more recently, genome-wide
association studies (GWAS) imply that a large number of loci
underlie the genetic architecture of complex traits [115]. GWAS
are usually multi-staged, have mostly focused on gene discovery
and typically set very stringent type-I error rates in the first stage to
avoid false positives. Analysis is most frequently performed one
SNP at a time. Consequently, these studies may not properly
capture all of the genetic variation that is present in the samples,
The initial wave of GWAS has found many genetic variants that
are robustly associated with disease or quantitative traits, but these
variants typically explain only a small fraction of the genetic
variance, and so the utility of predictions made using this
information can be limited.
An alternative to gene discovery is to focus on the prediction of
phenotypes using all genotypic (SNP) information across the whole
genome simultaneously. The prediction of phenotypes is useful in
a range of fields, from artificial selection programs [16] to risk
prediction in human medicine [17] and forensics. To predict
phenotypes, identification or genotyping of causal variants is not
necessary, as long as there are variants genotyped that are in
linkage disequilibrium (LD) with the causal variants [16,17].
To predict phenotypes from genomic data, the relationship
between genome-wide marker data and phenotypes needs to be
modeled. The single SNP regression approach that is often applied
in conjunction with stringent thresholds would be expected to
inaccurately estimate the proportion of variance that can be
explained from genotypic data. Instead, model selection
approaches are required to find the set of SNPs that best explains and
predicts variation in phenotype. Such approaches have already
been proposed for mapping multiple quantitative trait loci (QTL)
[1823] and recently a method was suggested for the simultaneous
analysis of all SNPs in a GWAS [24].
In this study, we use statistical modeling to fit multiple SNP
effects from a GWAS and derive the best model with a Bayesian
model selection approach termed Reversible Jump Markov Chain
Monte Carlo (RJMCMC) [25]. We predict unobserved phenotypes
for individuals based on genome-wide SNP data only, family
information (without genetic data) only, or on a combination of
the two.
Data
Publicly available data including pedigree, genotypic and
phenotypic information on heterogeneous stock mice were used
([26]; http://gscan.well.ox.ac.uk/). The total number of animals
was 2,296 from 85 unrelated families. The available pedigree
spanned four generations, generating complex relationships. In the
Results from recent genome-wide association studies
indicate that for most complex traits, there are many loci
that contribute to variation in observed phenotype and
that the effect of a single variant (single nucleotide
polymorphism, SNP) on a phenotype is small. Here, we
propose a method that combines the effects of multiple
SNPs to make a prediction of a phenotype that has not
been observed. We apply the method to data on mice,
using phenotypic and genomic data from some individuals
to predict phenotypes in other, either related or unrelated,
individuals. We find that correlations between predicted
and actual phenotypes are in the range of 0.4 to 0.9. The
method also shows that the SNPs used in the prediction
appear in regions that are known to contain genes
associated with the traits studied. The prediction of
unobserved phenotypes from high-density SNP data and
appropriate statistical methodology is feasible and can be
applied in human medicine, forensics, or artificial breeding
programs.
last generation, there were 172 full sib families with an average size of
,11 (SD ,8). Genotypes were available for 12,112 SNPs on most
animals in the pedigree, and we used the 11,730 SNPs on the
autosomal chromosomes. Phenotypes were already adjusted for the
environmental fixed effects, e.g. sex, age, year and season [26,27].
We chose three phenotypes, coat colour as a complex trait with a
number of known causal loci (estimated h2<0.72), and percentage of
CD8+ cells (%CD8) as a quantitative trait having high heritability
(estimated h2<0.99), and mean cellular haemoglobin (MCH) as a
quantitative trait having moderate heritability (estimated h2<0.55).
Coat colour, as used here, is a measure of the darkness of the coat
from white to black. For more detail about the data, see [26,27].
Models
We fitted a range of linear mixed models, with multiple SNPs as
fixed effects and, in some models, a polygenic effect to account for
additive genetic effects no (...truncated)