Extent of Gene Duplication in the Genomes of Drosophila, Nematode, and Yeast
Zhenglong Gu
1
2
Andre Cavalcanti
1
2
Feng-Chi Chen
1
2
Peter Bouman
0
1
Wen-Hsiung Li
1
2
0
Department of Statistics, University of Chicago
1
partment of Ecology and Evolution, University of Chicago
,
1101 East 57th street, Chicago, Illinois 60637
2
Department of Ecology and Evolution
We conducted a detailed analysis of duplicate genes in three complete genomes: yeast, Drosophila, and Caenorhabditis elegans. For two proteins belonging to the same family we used the criteria: (1) their similarity is $I (I 5 30% if L $ 150 a.a. and I 5 0.01n 1 4.8L20.32(1 1 exp(2L/1000)) if L , 150 a.a., where n 5 6 and L is the length of the alignable region), and (2) the length of the alignable region between the two sequences is $80% of the longer protein. We found it very important to delete isoforms (caused by alternative splicing), same genes with different names, and proteins derived from repetitive elements. We estimated that there were 530, 674, and 1,219 protein families in yeast, Drosophila, and C. elegans, respectively, so, as expected, yeast has the smallest number of duplicate genes. However, for the duplicate pairs with the number of substitutions per synonymous site (KS) , 0.01, Drosophila has only seven pairs, whereas yeast has 58 pairs and nematode has 153 pairs. After considering the possible effects of codon usage bias and gene conversion, these numbers became 6, 55, and 147, respectively. Thus, Drosophila appears to have much fewer young duplicate genes than do yeast and nematode. The larger numbers of duplicate pairs with KS , 0.01 in yeast and C. elegans were probably largely caused by block duplications. At any rate, it is clear that the genome of Drosophila melanogaster has undergone few gene duplications in the recent past and has much fewer gene families than C. elegans.
Introduction
It has been proposed that gene duplication is the
most important step for the origin of genetic novelties
(Ohno 1970, p. 72). With the availability of complete
genome sequences, it has become possible to study the
extent of gene duplication on a genome-wide scale.
Block duplications in Drosophila, yeast, and
Caenorhabditis elegans have been studied in detail by using
genomic data (Wolfe and Shields 1997; Seoighe and
Wolfe 1999; Friedman and Hughes 2001). Using the
BLASTP E value as the sole criterion for identifying
homologous proteins, Rubin et al. (2000) studied the
extents of gene duplication in yeast, Drosophila, and C.
elegans genomes. However, deciding whether two
proteins are homologous requires a more rigorous analysis.
For example, domain shuffling or sharing is known to
be a common mode for protein evolution (Doolittle
1995) and can mislead the identification of duplicate
genes because a low E value between nonhomologous
genes can be caused by a shared domain alone. Another
difficulty in identifying homologous genes is the
detection of remote homology. Deciding whether two
proteins are homologous becomes difficult when their
sequence identity is within the twilight zone (Doolittle
1986). Improvement in methodology often leads to the
discovery of new homologous relationships and new
gene family members (Krogh et al. 1994; Sonnhammer,
Eddy, and Durbin 1997).
The rate of gene duplication in a genome is also of
great interest. This type of study is possible only when
the whole genome data is available. Lynch and Conery
(2000) estimated the gene duplication rates in the yeast,
Drosophila, and C. elegans genomes using the
synonymous site changes (KS) as the time scale. However, it
is well known that codon usage is highly biased in some
genes in these organisms (Ikemura 1982; Akashi,
Kliman, and Eyre-Walker 1998). A negative correlation
between synonymous rate (KS) and strength of codon
usage bias in Drosophila suggests that in some genes
synonymous changes are not neutral (Sharp and Li 1989;
Moriyama and Hartl 1993), though Dunn, Bielawski,
and Yang (2001) argued against the existence of this
correlation. Therefore, the KS value might not reflect the
real age of a gene duplication. A combination of KS and
the genetic distances in intron and flanking regions
might be more informative.
The relatively good quality of genomic sequences
and concomitant annotation for yeast, Drosophila, and
C. elegans make it possible for us to investigate the
above questions in these genomes. However, the
presence of same genes with different names and the
existence of alternative splicing forms in the database make
it difficult to study the extent of gene duplication in a
genome. Moreover, retrotranscriptase (RT) and protein
parts derived from repetitive elements (REs) might
mislead the identification of homologous proteins. For these
reasons, it is important to clean the database. In this
paper, after carrying out a detailed cleaning procedure
for the protein databases of these three genomes, we
asked two questions: How many gene families are there
in each genome? How often has gene duplication
occurred in the recent past in each organism? We defined
two simple homology criteria by improving the criterion
adopted by Rost (1999). Using the new criteria for
identifying homologous genes and the single-linkage
algorithm for clustering, we estimated the number of gene
families in each of the three genomes. The frequency of
recent gene duplication was investigated by using gene
pairs with a small KS. We excluded the gene pairs with
possible gene conversion and codon usage bias by
comparing KS with the genetic distance in intron and
flanking regions. The effect of codon usage on KS in yeast
was further studied.
Materials and Methods
The protein data sets were obtained from the
following websites:
Caenorhabditis elegans: http://www.sanger.ac.uk/
Projects/Cpelegans/wormpep/ Wormpep release 40 was
used. There were 19,730 protein sequences in the
database, of which 48 did not have genomic position
information and 22 did not have corresponding coding
sequences (cds). We used the rest of the 19,660 protein
sequences in our analysis.
Yeast: ftp://ncbi.nlm.nih.gov/genbank/genomes/
Spcerevisiae/ We used the NCBI October 2000 version,
which was part of the Reference Sequence (RefSeq)
project. The annotation for this version was based on the
Saccharomyces Genome Database in the Stanford
genomic resources (SGD, http://genome-www.stanford.
edu/Saccharomyces/). A total of 6,297 protein sequences
were in the database and used in our analysis.
Information for block duplications and gene pairs within the
blocks was downloaded from the website http://
www.gen.tcd.ie/khwolfe/ (Wolfe and Shields 1997).
Drosophila: ftp://ncbi.nlm.nih.gov/genbank/genomes/
Dpmelanogaster/ Release 2, October 2000 from NCBI
was used. A total of 14,335 protein sequences were in
the database.
The corresponding cds and genomic sequences for
all three genomes were also downloaded from the above
websites.
The data for each genome was processed as
described below.
First Round Grouping
In each of the three genomes studied, every protein (...truncated)