Extent of Gene Duplication in the Genomes of Drosophila, Nematode, and Yeast (pdf)

Article PDF cannot be displayed. You can download it here:

https://mbe.oxfordjournals.org/content/19/3/256.full.pdf

Extent of Gene Duplication in the Genomes of Drosophila, Nematode, and Yeast

Zhenglong Gu 1 2 Andre Cavalcanti 1 2 Feng-Chi Chen 1 2 Peter Bouman 0 1 Wen-Hsiung Li 1 2 0 Department of Statistics, University of Chicago 1 partment of Ecology and Evolution, University of Chicago , 1101 East 57th street, Chicago, Illinois 60637 2 Department of Ecology and Evolution We conducted a detailed analysis of duplicate genes in three complete genomes: yeast, Drosophila, and Caenorhabditis elegans. For two proteins belonging to the same family we used the criteria: (1) their similarity is $I (I 5 30% if L $ 150 a.a. and I 5 0.01n 1 4.8L20.32(1 1 exp(2L/1000)) if L , 150 a.a., where n 5 6 and L is the length of the alignable region), and (2) the length of the alignable region between the two sequences is $80% of the longer protein. We found it very important to delete isoforms (caused by alternative splicing), same genes with different names, and proteins derived from repetitive elements. We estimated that there were 530, 674, and 1,219 protein families in yeast, Drosophila, and C. elegans, respectively, so, as expected, yeast has the smallest number of duplicate genes. However, for the duplicate pairs with the number of substitutions per synonymous site (KS) , 0.01, Drosophila has only seven pairs, whereas yeast has 58 pairs and nematode has 153 pairs. After considering the possible effects of codon usage bias and gene conversion, these numbers became 6, 55, and 147, respectively. Thus, Drosophila appears to have much fewer young duplicate genes than do yeast and nematode. The larger numbers of duplicate pairs with KS , 0.01 in yeast and C. elegans were probably largely caused by block duplications. At any rate, it is clear that the genome of Drosophila melanogaster has undergone few gene duplications in the recent past and has much fewer gene families than C. elegans. Introduction It has been proposed that gene duplication is the most important step for the origin of genetic novelties (Ohno 1970, p. 72). With the availability of complete genome sequences, it has become possible to study the extent of gene duplication on a genome-wide scale. Block duplications in Drosophila, yeast, and Caenorhabditis elegans have been studied in detail by using genomic data (Wolfe and Shields 1997; Seoighe and Wolfe 1999; Friedman and Hughes 2001). Using the BLASTP E value as the sole criterion for identifying homologous proteins, Rubin et al. (2000) studied the extents of gene duplication in yeast, Drosophila, and C. elegans genomes. However, deciding whether two proteins are homologous requires a more rigorous analysis. For example, domain shuffling or sharing is known to be a common mode for protein evolution (Doolittle 1995) and can mislead the identification of duplicate genes because a low E value between nonhomologous genes can be caused by a shared domain alone. Another difficulty in identifying homologous genes is the detection of remote homology. Deciding whether two proteins are homologous becomes difficult when their sequence identity is within the twilight zone (Doolittle 1986). Improvement in methodology often leads to the discovery of new homologous relationships and new gene family members (Krogh et al. 1994; Sonnhammer, Eddy, and Durbin 1997). The rate of gene duplication in a genome is also of great interest. This type of study is possible only when the whole genome data is available. Lynch and Conery (2000) estimated the gene duplication rates in the yeast, Drosophila, and C. elegans genomes using the synonymous site changes (KS) as the time scale. However, it is well known that codon usage is highly biased in some genes in these organisms (Ikemura 1982; Akashi, Kliman, and Eyre-Walker 1998). A negative correlation between synonymous rate (KS) and strength of codon usage bias in Drosophila suggests that in some genes synonymous changes are not neutral (Sharp and Li 1989; Moriyama and Hartl 1993), though Dunn, Bielawski, and Yang (2001) argued against the existence of this correlation. Therefore, the KS value might not reflect the real age of a gene duplication. A combination of KS and the genetic distances in intron and flanking regions might be more informative. The relatively good quality of genomic sequences and concomitant annotation for yeast, Drosophila, and C. elegans make it possible for us to investigate the above questions in these genomes. However, the presence of same genes with different names and the existence of alternative splicing forms in the database make it difficult to study the extent of gene duplication in a genome. Moreover, retrotranscriptase (RT) and protein parts derived from repetitive elements (REs) might mislead the identification of homologous proteins. For these reasons, it is important to clean the database. In this paper, after carrying out a detailed cleaning procedure for the protein databases of these three genomes, we asked two questions: How many gene families are there in each genome? How often has gene duplication occurred in the recent past in each organism? We defined two simple homology criteria by improving the criterion adopted by Rost (1999). Using the new criteria for identifying homologous genes and the single-linkage algorithm for clustering, we estimated the number of gene families in each of the three genomes. The frequency of recent gene duplication was investigated by using gene pairs with a small KS. We excluded the gene pairs with possible gene conversion and codon usage bias by comparing KS with the genetic distance in intron and flanking regions. The effect of codon usage on KS in yeast was further studied. Materials and Methods The protein data sets were obtained from the following websites: Caenorhabditis elegans: http://www.sanger.ac.uk/ Projects/Cpelegans/wormpep/ Wormpep release 40 was used. There were 19,730 protein sequences in the database, of which 48 did not have genomic position information and 22 did not have corresponding coding sequences (cds). We used the rest of the 19,660 protein sequences in our analysis. Yeast: ftp://ncbi.nlm.nih.gov/genbank/genomes/ Spcerevisiae/ We used the NCBI October 2000 version, which was part of the Reference Sequence (RefSeq) project. The annotation for this version was based on the Saccharomyces Genome Database in the Stanford genomic resources (SGD, http://genome-www.stanford. edu/Saccharomyces/). A total of 6,297 protein sequences were in the database and used in our analysis. Information for block duplications and gene pairs within the blocks was downloaded from the website http:// www.gen.tcd.ie/khwolfe/ (Wolfe and Shields 1997). Drosophila: ftp://ncbi.nlm.nih.gov/genbank/genomes/ Dpmelanogaster/ Release 2, October 2000 from NCBI was used. A total of 14,335 protein sequences were in the database. The corresponding cds and genomic sequences for all three genomes were also downloaded from the above websites. The data for each genome was processed as described below. First Round Grouping In each of the three genomes studied, every protein (...truncated)