Predicting Unobserved Phenotypes for Complex Traits from Whole-Genome SNP Data

PLoS Genetics, Oct 2008

Genome-wide association studies (GWAS) for quantitative traits and disease in humans and other species have shown that there are many loci that contribute to the observed resemblance between relatives. GWAS to date have mostly focussed on discovery of genes or regulatory regions habouring causative polymorphisms, using single SNP analyses and setting stringent type-I error rates. Genome-wide marker data can also be used to predict genetic values and therefore predict phenotypes. Here, we propose a Bayesian method that utilises all marker data simultaneously to predict phenotypes. We apply the method to three traits: coat colour, %CD8 cells, and mean cell haemoglobin, measured in a heterogeneous stock mouse population. We find that a model that contains both additive and dominance effects, estimated from genome-wide marker data, is successful in predicting unobserved phenotypes and is significantly better than a prediction based upon the phenotypes of close relatives. Correlations between predicted and actual phenotypes were in the range of 0.4 to 0.9 when half of the number of families was used to estimate effects and the other half for prediction. Posterior probabilities of SNPs being associated with coat colour were high for regions that are known to contain loci for this trait. The prediction of phenotypes using large samples, high-density SNP data, and appropriate statistical methodology is feasible and can be applied in human medicine, forensics, or artificial selection programs.

Predicting Unobserved Phenotypes for Complex Traits from Whole-Genome SNP Data

Visscher PM (2008) Predicting Unobserved Phenotypes for Complex Traits from Whole-Genome SNP Data. PLoS Genet 4(10): e1000231. doi:10.1371/journal.pgen.1000231 Predicting Unobserved Phenotypes for Complex Traits from Whole-Genome SNP Data Sang Hong Lee 0 Julius H. J. van der Werf 0 Ben J. Hayes 0 Michael E. Goddard 0 Peter M. Visscher 0 Bret A. Payseur, University of Wisconsin, Madison, United States of America 0 1 School of Environmental and Rural Science, University of New England , Armidale, New South Wales , Australia , 2 National Institute of Animal Science, Rural Development Administration, Cheon An, Korea, 3 Department of Primary Industry , Victoria , Australia , 4 Faculty of Land and Food Resources, University of Melbourne , Melbourne , Australia , 5 Queensland Institute of Medical Research , Brisbane , Australia Genome-wide association studies (GWAS) for quantitative traits and disease in humans and other species have shown that there are many loci that contribute to the observed resemblance between relatives. GWAS to date have mostly focussed on discovery of genes or regulatory regions habouring causative polymorphisms, using single SNP analyses and setting stringent type-I error rates. Genome-wide marker data can also be used to predict genetic values and therefore predict phenotypes. Here, we propose a Bayesian method that utilises all marker data simultaneously to predict phenotypes. We apply the method to three traits: coat colour, %CD8 cells, and mean cell haemoglobin, measured in a heterogeneous stock mouse population. We find that a model that contains both additive and dominance effects, estimated from genome-wide marker data, is successful in predicting unobserved phenotypes and is significantly better than a prediction based upon the phenotypes of close relatives. Correlations between predicted and actual phenotypes were in the range of 0.4 to 0.9 when half of the number of families was used to estimate effects and the other half for prediction. Posterior probabilities of SNPs being associated with coat colour were high for regions that are known to contain loci for this trait. The prediction of phenotypes using large samples, high-density SNP data, and appropriate statistical methodology is feasible and can be applied in human medicine, forensics, or artificial selection programs. - Results from linkage analyses and, more recently, genome-wide association studies (GWAS) imply that a large number of loci underlie the genetic architecture of complex traits [115]. GWAS are usually multi-staged, have mostly focused on gene discovery and typically set very stringent type-I error rates in the first stage to avoid false positives. Analysis is most frequently performed one SNP at a time. Consequently, these studies may not properly capture all of the genetic variation that is present in the samples, The initial wave of GWAS has found many genetic variants that are robustly associated with disease or quantitative traits, but these variants typically explain only a small fraction of the genetic variance, and so the utility of predictions made using this information can be limited. An alternative to gene discovery is to focus on the prediction of phenotypes using all genotypic (SNP) information across the whole genome simultaneously. The prediction of phenotypes is useful in a range of fields, from artificial selection programs [16] to risk prediction in human medicine [17] and forensics. To predict phenotypes, identification or genotyping of causal variants is not necessary, as long as there are variants genotyped that are in linkage disequilibrium (LD) with the causal variants [16,17]. To predict phenotypes from genomic data, the relationship between genome-wide marker data and phenotypes needs to be modeled. The single SNP regression approach that is often applied in conjunction with stringent thresholds would be expected to inaccurately estimate the proportion of variance that can be explained from genotypic data. Instead, model selection approaches are required to find the set of SNPs that best explains and predicts variation in phenotype. Such approaches have already been proposed for mapping multiple quantitative trait loci (QTL) [1823] and recently a method was suggested for the simultaneous analysis of all SNPs in a GWAS [24]. In this study, we use statistical modeling to fit multiple SNP effects from a GWAS and derive the best model with a Bayesian model selection approach termed Reversible Jump Markov Chain Monte Carlo (RJMCMC) [25]. We predict unobserved phenotypes for individuals based on genome-wide SNP data only, family information (without genetic data) only, or on a combination of the two. Data Publicly available data including pedigree, genotypic and phenotypic information on heterogeneous stock mice were used ([26]; http://gscan.well.ox.ac.uk/). The total number of animals was 2,296 from 85 unrelated families. The available pedigree spanned four generations, generating complex relationships. In the Results from recent genome-wide association studies indicate that for most complex traits, there are many loci that contribute to variation in observed phenotype and that the effect of a single variant (single nucleotide polymorphism, SNP) on a phenotype is small. Here, we propose a method that combines the effects of multiple SNPs to make a prediction of a phenotype that has not been observed. We apply the method to data on mice, using phenotypic and genomic data from some individuals to predict phenotypes in other, either related or unrelated, individuals. We find that correlations between predicted and actual phenotypes are in the range of 0.4 to 0.9. The method also shows that the SNPs used in the prediction appear in regions that are known to contain genes associated with the traits studied. The prediction of unobserved phenotypes from high-density SNP data and appropriate statistical methodology is feasible and can be applied in human medicine, forensics, or artificial breeding programs. last generation, there were 172 full sib families with an average size of ,11 (SD ,8). Genotypes were available for 12,112 SNPs on most animals in the pedigree, and we used the 11,730 SNPs on the autosomal chromosomes. Phenotypes were already adjusted for the environmental fixed effects, e.g. sex, age, year and season [26,27]. We chose three phenotypes, coat colour as a complex trait with a number of known causal loci (estimated h2<0.72), and percentage of CD8+ cells (%CD8) as a quantitative trait having high heritability (estimated h2<0.99), and mean cellular haemoglobin (MCH) as a quantitative trait having moderate heritability (estimated h2<0.55). Coat colour, as used here, is a measure of the darkness of the coat from white to black. For more detail about the data, see [26,27]. Models We fitted a range of linear mixed models, with multiple SNPs as fixed effects and, in some models, a polygenic effect to account for additive genetic effects no (...truncated)


This is a preview of a remote PDF: http://www.plosgenetics.org/article/fetchObject.action?uri=info%3Adoi%2F10.1371/journal.pgen.1000231&representation=PDF
Article home page: http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1000231

Sang Hong Lee, Julius H. J. van der Werf, Ben J. Hayes, Michael E. Goddard, Peter M. Visscher. Predicting Unobserved Phenotypes for Complex Traits from Whole-Genome SNP Data, PLoS Genetics, 2008, 10, DOI: 10.1371/journal.pgen.1000231