Mutational Bias Affects Protein Evolution in Flowering Plants
Mutational Bias Affects Protein Evolution in Flowering Plants
Huai-chun Wang, Gregory A. C. Singer, and Donal A. Hickey
Department of Biology, University of Ottawa, Ottawa, Ontario, Canada
Amino acid sequences from several thousand homologous gene pairs were compared for two plant genomes, Oryza
sativa and Arabidopsis thaliana. The Arabidopsis genes all have similar GþC (guanine plus cytosine) contents, whereas
their homologs in rice span a wide range of GþC levels. The results show that those rice genes that display increased
divergence in their nucleotide composition (specifically, increased GþC content) showed a corresponding, predictable
change in the amino acid compositions of the encoded proteins relative to their Arabidopsis homologs. This trend was
not seen in a ‘‘control’’ set of rice genes that had nucleotide contents closer to their Arabidopsis homologs. In addition to
showing an overall difference in the amino acid composition of the homologous proteins, we were also able to
investigate the biased patterns of amino acid substitution since the divergence of these two species. We found that the
amino acid exchange matrix was highly asymmetric when comparing the High GþC rice genes with their Arabidopsis
homologs. Finally, we investigated the possible causes of this biased pattern of sequence evolution. Our results indicate
that the biased pattern of protein evolution is the consequence, rather than the cause, of the corresponding changes in
nucleotide content. In fact, there is an even more marked asymmetry in the patterns of substitution at synonymous
nucleotide sites. Surprisingly, there is a very strong negative correlation between the level of nucleotide bias and the
length of the coding sequences within the rice genome. This difference in gene length may provide important clues about
the underlying mechanisms.
Differences in GþC content among genomes have
been intensively studied and wide variations have been
noted both among entire genomes and among genes within
genomes (Li 1997; Karlin, Campbell, and Mrazek 1998;
Gautier 2000). The differences in nucleotide content
between genomes have been shown to cause concomitant
changes in the amino acid compositions of the encoded
proteins (Collins and Jukes 1993; Foster, Jermiin, and
Hickey 1997; Lobry 1997; Wilquet and Van de Casteele
1999; Singer and Hickey 2000; Kreil and Ouzounis 2001).
Most of these previous studies were based primarily on
prokaryotic genomes because of the lack of large-scale
genomic data for plants and animals. Such data are now
becoming available, however. The recent availability of
genomic data for multicellular plants and animals not only
allows us to extend previous studies to the genomes of
multicellular eukaryotes but also enables us to trace the
patterns of nucleotide and amino acid substitution between
lineages that have well-defined evolutionary relationships.
Therefore, we not only see the end results of evolutionary
changes between genomes but also trace the paths by
which these changes took place.
In this study, we compared homologous gene pairs
from two species of flowering plants, Oryza sativa (rice)
and Arabidopsis thaliana. Because these two species
diverged less than 200 MYA, many homologous
sequences from the two genomes are unambiguously
alignable. Moreover, the level of amino acid sequence
divergence between homologous proteins is relatively
low, allowing us to gauge the patterns of amino acid
substitution. Finally, there is a wide variation in the
nucleotide contents of the rice genes: some closely
resemble their Arabidopsis homologs in GþC content,
Key words: comparative genomics, angiosperm, nucleotide, amino
acid.
E-mail: .
Mol. Biol. Evol. 21(1):90–96. 2004
DOI: 10.1093/molbev/msh003
Molecular Biology and Evolution vol. 21 no. 1
Ó Society for Molecular Biology and Evolution 2004; all rights reserved.
90
whereas others have significantly elevated levels of GþC
relative to their homologs (Carels and Bernardi 2000).
Because all of the genes diverged from their common
ancestral sequences at the same point in evolutionary
time, this provides us with a ‘‘controlled’’ evolutionary
experiment, enabling us to do a comparative study of two
sets of rice genes that are evolving under contrasting
evolutionary constraints.
Materials and Methods
Sources of Sequence Data
Protein sequences from O. sativa were obtained
from the Gramene database (Ware et al. 2002) (ftp://
www.gramene.org/pub/gramene/protein/sequence/rice_
sptrembl.fa). This database contained 8,985 sequences as
of May 2002. From the protein sequence identifiers, we
first got corresponding EMBL accession numbers by
searching SwissProt (Bairoch and Apweiler 2000), then
extracted corresponding EMBL sequence records
(Stoesser et al. 2002). From the EMBL records we wrote
a program to extract coding sequences and 9,916 coding
sequences were obtained. A total of 443 sequences were
shorter than 75 codons and were excluded from the
analysis. The remaining sequences were subjected to
a codon integrity check using CodonW (http://www.
molbiol.ox.ac.uk/cu/), and we further cleaned the data by
removing redundant sequences. The final data set of O.
sativa coding sequences contains 7,886 nonredundant
sequences. Using EMBOSS/transeq (Rice, Longden, and
Bleasby 2000) to translate the file, we generated a corresponding nonredundant protein sequence file.
A total of 26,178 protein-coding sequences from A.
thaliana (from five chromosomes) were downloaded from
National Center for Biotechnology Information (NCBI)
FTP server (ftp://ftp.ncbi.nih.gov/genbank/genomes/
A_thaliana/). After passing the sequences to CodonW for
codon integrity check and removing genes shorter than 75
codons, a total of 25,625 Arabidopsis coding sequences
remained for analysis. Protein sequences of Arabidopsis
Introduction
Mutational Bias in Plant Proteins 91
Results
Compositional Distribution of Rice and Arabidopsis
Homologous Genes
FIG. 1.—Distribution of GþC contents among rice and Arabidopsis
genes. Homologous gene pairs only were used for this analysis. This data
set included 8,894 genes—4,447 rice genes and 4,447 homologs from
Arabidopsis.
Identification and Comparison of Homologous
Sequences
Homologous protein pairs between O. sativa and A.
thaliana were identified by performing BlastP searches
(Altschul et al. 1990) of the rice protein sequences against
Arabidopsis sequences with a cutoff expect score of 1e-20.
When a rice protein had more than one Arabidopsis
protein hit, the pair having the most significant expect
score was retained. In all, 4,447 homologous pairs were
identified.
After the homologous protein sequences had been
identified, the corresponding nucleotide sequences were
scored for nucleotide content. In this study, we ranked the
rice homologs by their GþC content. We then compared
the group of 1,000 rice genes with the highest GþC
content (the ‘‘high GþC’’ class) to their homologs in the
Arabidopsis genome. We also performed a par (...truncated)