Genotype–phenotype associations: substitution models to detect evolutionary associations between phenotypic variables and genotypic evolutionary rate
Timothy D. O'Connor
0
Nicholas I. Mundy
0
0
Department of Zoology, University of Cambridge
, Cambridge CB2 3EJ,
UK
Motivation: Mapping between genotype and phenotype is one of the primary goals of evolutionary genetics but one that has received little attention at the interspecies level. Recent developments in phylogenetics and statistical modelling have typically been used to examine molecular and phenotypic evolution separately. We have used this background to develop phylogenetic substitution models to test for associations between evolutionary rate of genotype and phenotype. We do this by creating hybrid rate matrices between genotype and phenotype. Results: Simulation results show our models to be accurate in detecting genotype-phenotype associations and robust for various factors that typically affect maximum likelihood methods, such as number of taxa, level of relevant signal, proportion of sites affected and length of evolutionary divergence. Further, simulations show that our method is robust to homogeneity assumptions. We apply the models to datasets of male reproductive system genes in relation to mating systems of primates. We show that evolution of semenogelin II is significantly associated with mating systems whereas two negative control genes (cytochrome b and peptidase inhibitor 3) show no significant association. This provides the first hybrid substitution model of which we are aware to directly test the association between genotype and phenotype using a phylogenetic framework. Availability: Perl and HYPHY scripts are available upon request from the authors. Contact: Supplementary information: Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
One of the major issues in evolutionary genetics research is the
relationship between genotype and phenotype. Natural selection
acts on phenotypes and indirectly leaves a signal at the molecular
level. The connection between the two levels is important because
it ties together the effects of natural selection. Thus, selection for
a phenotype can change the genetic variation for specific genes or
genomic regions.
Within the field of molecular evolution, the study of adaptation
has focused on methods for detecting selection in coding sequences,
with any inferences about phenotypic evolution being indirect. At the
forefront of this enquiry, Yang, Nei, Goldman and others (Goldman
and Yang, 1994; Nei and Gojobori, 1986; Yang, 2007) developed
computational models of molecular evolution to distinguish between
neutral mutation and selection. These codon models focus on the
ratio (dN/dS) of the rate of non-synonymous or protein altering
changes to the rate of synonymous or silent changes assumed to
estimate the neutral rate of evolution (Goldman and Yang, 1994;
Muse and Gaut, 1994).
At intraspecies level, and occasionally at the closely related
interspecies level, quantitative trait locus (QTL) analyses have been
designed to detect specific regions of the genome associated with
a given trait (Slate, 2005). These methods typically use pedigree
information or known population structure to make specific crosses
for particular phenotypes (Lynch and Walsh, 1998). The crosses
are then genotyped using SNP or other markers across the whole
genome and statistical associations of the linkage disequilibrium
between genotype and phenotype are identified. Other studies
use association mapping to identify genomic regions involved in
phenotypic differences, or perform candidate gene associations, e.g.
MC1R in relation to colouration differences (Nachman et al., 2003;
Theron et al., 2001).
A few studies have looked for associations at the interspecies level
using phylogenetics. The two main approaches used are regression
analysis between evolutionary rate and phenotypic variation and
codon branch-site models with phenotypes assigned to branches.
In the regression analyses published to date, dN/dS ratios are
calculated for each branch in the tree using the free-ratios model
(Yang, 1998) and a regression is performed by (i) pairing the dN/dS
ratio for each terminal branch with the phenotype value for its
terminal node or (ii) pairing the dN/dS ratio for every branch
with the reconstructed phenotype on that branch. Using the first
approach in primates, Dorus et al. (2004) found a positive correlation
between levels of sperm competition (mean number of partners
in a periovulatory period) and the dN/dS ratio of semenogelin II
(SEMG2), a gene encoding a protein involved in primate semen.
Later, Hurle et al. (2007) added additional taxa and performed a
similar analysis but found no significant trend.
In a similar approach, Herlyn and Zischler (2007) found a negative
correlation between the dN/dS in sperm ligand zonadhesin (ZAN )
and primate body weight dimorphism. In birds, Nadeau et al. (2007)
employed this method to study correlations between pigmentation
genes and sexual dimorphic colour variation in galliforms. Also, they
used the second method and correlated dN/dS ratios for internal
and terminal branches and ancestral reconstructions of sexual
dimorphism in colouration over the phylogenetic tree. Both methods
showed a correlation between MC1R, but not other pigmentation
genes, and dimorphic colouration (Nadeau et al., 2007).
The second method employed is the use of branch-site codon tests
which test for changes in selection pressure on particular branches
with phenotypes of interest. This method tests for positive selection
by comparing a null model of neutral evolution to a model of positive
selection on those branches (Zhang et al., 2005). Ramm et al. (2008)
reanalysed SEMG2 as well as SEMG1 in primates using the codon
models. They found that branches leading to species with high levels
of sperm competition (multimale mating systems) show significant
evidence of positive selection in SEMG2 but not SEMG1. Branches
leading to species with low levels of sperm competition show no
evidence for positive selection at either locus. In addition, they
tested seven rodent semen proteins and found that Svs2, the rodent
orthologue to SEMG2, showed significant evidence for positive
selection on branches leading to taxa with high relative testis size.
All of these tests can be criticized on theoretical grounds. For tests
using phenotypic states derived from terminal taxa, the phenotypic
state is applied to a whole branch without regard to its evolution.
This creates a problem because some portion of the branch being
associated with a phenotype is potentially misapplied, by ignoring
the timing of the evolutionary loss or gain of the phenotype. For
tests relying on phenotypic character reconstruction for internal
assignment, error in reconstruction is not taken into account in
downstream analyses.
One way around these difficulties is the maximum likelihood
approach, which assigns characters to terminal nodes and probability
distributions for those characters to internal nodes (Felsenstein,
1981). Thus, it estimates th (...truncated)