Mutational Bias Affects Protein Evolution in Flowering Plants (pdf)

Article PDF cannot be displayed. You can download it here:

https://mbe.oxfordjournals.org/content/21/1/90.full.pdf

Mutational Bias Affects Protein Evolution in Flowering Plants

Mutational Bias Affects Protein Evolution in Flowering Plants Huai-chun Wang, Gregory A. C. Singer, and Donal A. Hickey Department of Biology, University of Ottawa, Ottawa, Ontario, Canada Amino acid sequences from several thousand homologous gene pairs were compared for two plant genomes, Oryza sativa and Arabidopsis thaliana. The Arabidopsis genes all have similar GþC (guanine plus cytosine) contents, whereas their homologs in rice span a wide range of GþC levels. The results show that those rice genes that display increased divergence in their nucleotide composition (specifically, increased GþC content) showed a corresponding, predictable change in the amino acid compositions of the encoded proteins relative to their Arabidopsis homologs. This trend was not seen in a ‘‘control’’ set of rice genes that had nucleotide contents closer to their Arabidopsis homologs. In addition to showing an overall difference in the amino acid composition of the homologous proteins, we were also able to investigate the biased patterns of amino acid substitution since the divergence of these two species. We found that the amino acid exchange matrix was highly asymmetric when comparing the High GþC rice genes with their Arabidopsis homologs. Finally, we investigated the possible causes of this biased pattern of sequence evolution. Our results indicate that the biased pattern of protein evolution is the consequence, rather than the cause, of the corresponding changes in nucleotide content. In fact, there is an even more marked asymmetry in the patterns of substitution at synonymous nucleotide sites. Surprisingly, there is a very strong negative correlation between the level of nucleotide bias and the length of the coding sequences within the rice genome. This difference in gene length may provide important clues about the underlying mechanisms. Differences in GþC content among genomes have been intensively studied and wide variations have been noted both among entire genomes and among genes within genomes (Li 1997; Karlin, Campbell, and Mrazek 1998; Gautier 2000). The differences in nucleotide content between genomes have been shown to cause concomitant changes in the amino acid compositions of the encoded proteins (Collins and Jukes 1993; Foster, Jermiin, and Hickey 1997; Lobry 1997; Wilquet and Van de Casteele 1999; Singer and Hickey 2000; Kreil and Ouzounis 2001). Most of these previous studies were based primarily on prokaryotic genomes because of the lack of large-scale genomic data for plants and animals. Such data are now becoming available, however. The recent availability of genomic data for multicellular plants and animals not only allows us to extend previous studies to the genomes of multicellular eukaryotes but also enables us to trace the patterns of nucleotide and amino acid substitution between lineages that have well-defined evolutionary relationships. Therefore, we not only see the end results of evolutionary changes between genomes but also trace the paths by which these changes took place. In this study, we compared homologous gene pairs from two species of flowering plants, Oryza sativa (rice) and Arabidopsis thaliana. Because these two species diverged less than 200 MYA, many homologous sequences from the two genomes are unambiguously alignable. Moreover, the level of amino acid sequence divergence between homologous proteins is relatively low, allowing us to gauge the patterns of amino acid substitution. Finally, there is a wide variation in the nucleotide contents of the rice genes: some closely resemble their Arabidopsis homologs in GþC content, Key words: comparative genomics, angiosperm, nucleotide, amino acid. E-mail: . Mol. Biol. Evol. 21(1):90–96. 2004 DOI: 10.1093/molbev/msh003 Molecular Biology and Evolution vol. 21 no. 1 Ó Society for Molecular Biology and Evolution 2004; all rights reserved. 90 whereas others have significantly elevated levels of GþC relative to their homologs (Carels and Bernardi 2000). Because all of the genes diverged from their common ancestral sequences at the same point in evolutionary time, this provides us with a ‘‘controlled’’ evolutionary experiment, enabling us to do a comparative study of two sets of rice genes that are evolving under contrasting evolutionary constraints. Materials and Methods Sources of Sequence Data Protein sequences from O. sativa were obtained from the Gramene database (Ware et al. 2002) (ftp:// www.gramene.org/pub/gramene/protein/sequence/rice_ sptrembl.fa). This database contained 8,985 sequences as of May 2002. From the protein sequence identifiers, we first got corresponding EMBL accession numbers by searching SwissProt (Bairoch and Apweiler 2000), then extracted corresponding EMBL sequence records (Stoesser et al. 2002). From the EMBL records we wrote a program to extract coding sequences and 9,916 coding sequences were obtained. A total of 443 sequences were shorter than 75 codons and were excluded from the analysis. The remaining sequences were subjected to a codon integrity check using CodonW (http://www. molbiol.ox.ac.uk/cu/), and we further cleaned the data by removing redundant sequences. The final data set of O. sativa coding sequences contains 7,886 nonredundant sequences. Using EMBOSS/transeq (Rice, Longden, and Bleasby 2000) to translate the file, we generated a corresponding nonredundant protein sequence file. A total of 26,178 protein-coding sequences from A. thaliana (from five chromosomes) were downloaded from National Center for Biotechnology Information (NCBI) FTP server (ftp://ftp.ncbi.nih.gov/genbank/genomes/ A_thaliana/). After passing the sequences to CodonW for codon integrity check and removing genes shorter than 75 codons, a total of 25,625 Arabidopsis coding sequences remained for analysis. Protein sequences of Arabidopsis Introduction Mutational Bias in Plant Proteins 91 Results Compositional Distribution of Rice and Arabidopsis Homologous Genes FIG. 1.—Distribution of GþC contents among rice and Arabidopsis genes. Homologous gene pairs only were used for this analysis. This data set included 8,894 genes—4,447 rice genes and 4,447 homologs from Arabidopsis. Identification and Comparison of Homologous Sequences Homologous protein pairs between O. sativa and A. thaliana were identified by performing BlastP searches (Altschul et al. 1990) of the rice protein sequences against Arabidopsis sequences with a cutoff expect score of 1e-20. When a rice protein had more than one Arabidopsis protein hit, the pair having the most significant expect score was retained. In all, 4,447 homologous pairs were identified. After the homologous protein sequences had been identified, the corresponding nucleotide sequences were scored for nucleotide content. In this study, we ranked the rice homologs by their GþC content. We then compared the group of 1,000 rice genes with the highest GþC content (the ‘‘high GþC’’ class) to their homologs in the Arabidopsis genome. We also performed a par (...truncated)