Comprehensive analysis of pseudogenes in prokaryotes: widespread gene decay and failure of putative horizontally transferred genes
Re e2LVt0ioua0luls4.meea5r,cIshsue 9, Article R64 Comprehensive analysis of pseudogenes in prokaryotes: widespread gene decay and failure of putative horizontally transferred genes
Yang Liu 1 2
Paul M Harrison 2
Victor Kunin 0
Mark Gerstein 2
0 Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation , Cambridge CB10 1SD , UK
1 Current address: Department of Biomedical Informatics, Columbia University , 622 W 168th street, New York, NY 10032 , USA
2 Department of Molecular Biophysics and Biochemistry, Yale University , PO Box 208114, New Haven, CT 06520-8114 , USA
Background: Pseudogenes often manifest themselves as disabled copies of known genes. In prokaryotes, it was generally believed (with a few well-known exceptions) that they were rare. Results: We have carried out a comprehensive analysis of the occurrence of pseudogenes in a diverse selection of 64 prokaryote genomes. Overall, we find a total of around 7,000 candidate pseudogenes. Moreover, in all the genomes surveyed, pseudogenes occur in at least 1 to 5% of all gene-like sequences, with some genomes having considerably higher occurrence. Although many large populations of pseudogenes arise from large, diverse protein families (for example, the ABC transporters), notable numbers of pseudogenes are associated with specific families that do not occur that widely. These include the cytochrome P450 and PPE families (PF00067 and PF00823) and others that have a direct role in DNA transposition. Conclusions: We find suggestive evidence that a large fraction of prokaryote pseudogenes arose from failed horizontal transfer events. In particular, we find that pseudogenes are more than twice as likely as genes to have anomalous codon usage associated with horizontal transfer. Moreover, we found a significant difference in the number of horizontally transferred pseudogenes in pathogenic and non-pathogenic strains of Escherichia coli.
Genes that have recently fallen out of use for an organism are
often detectable in the genome as pseudogenes - disabled
copies of genes characterizable by disruptions of their reading
frames due to frameshifts and premature stop codons [1-3].
Surveys of the pseudogene populations of eukaryotes
(budding yeast, nematode worm, fruit fly and human) have
recently been completed [
]. These pseudogene analyses
have yielded insights into eukaryotic proteome evolution,
showing that duplicated pseudogene formation tends to occur
in younger, more lineage-specific, protein families, and is in
many cases linked to the generation of functional diversity
. However, pseudogene formation in most prokaryotes has
not been analyzed as a matter of course, and has, historically,
been assumed to be minimal . Some recent substantial
populations of pseudogenes have been discovered in
pathogenic bacteria, most notably in the leprosy bacillus
Mycobacterium leprae, where around 1,100 pseudogenes (compared
to around 1,600 genes) were found, with pseudogene
formation providing a 'fossil record' of recent wholesale loss of
Genome sequences 11 archaea 53 bacteria
Intergenic DNA sequences
Six-frame FastX search and alignment
Pseudogene candidates (22,197)
Pseudogene candidates (6,895)
Prokaryotic protein dataset
from 64 prokaryotes and SWISSPROT
1. Artificial disablements at the
ends of aligned sequences
2. Different codon usage
10 20 30 40 50 60 70 80 90 100
Position of disablements in pseudogene sequences
PFsigeuudroeg1enes in prokaryotes
Pseudogenes in prokaryotes. (a) Procedure for assigning pseudogenes.
The flow chart shows the steps in identifying pseudogenes in 64
prokaryote genomes. The steps include: separate intergenic regions from
coding sequence (hypothetical ORFs were excluded); six-frame FastX
search on intergenic regions for pseudogene candidates; quality control to
reduce false-positive results introduced by artificial disablement or by
different codon usage. (b) The occurrence of relative disablement
positions in pseudogenes, which were normalized on a 100-residue scale
based on ratios of the distances from starting residues to disablements to
the length of pseudogenes. The yellow bars indicate the distribution of
disablement positions before the last quality-control step and the green
bars show the distribution after minimizing false-positive pseudogenes.
pathways involved in lipid metabolism and anaerobic
Here we want to address the question of whether these large
populations are exceptional, or whether there are substantial
populations of pseudogenes in other prokaryotic genomes. If
so, from a holistic 'polygenomic' perspective, what sorts of
proteins tend to form prokaryotic pseudogenes? And are
there any themes in common with the occurrence of
pseudogenes in eukaryotes?
To address these broad questions, we have adapted a pipeline
developed for eukaryotic pseudogene identification to 64
prokaryotic genomes . The species analyzed include
archaea, pathogenic bacteria and non-pathogenic bacteria,
and many of the pathogenic bacteria are also important
organisms in current biodefense research. We have found
nearly 7,000 pseudogenes, with notable numbers of
pseudogenes for specific families linked to DNA transposition and
also that have some role in environmental responses. Our
results, which we have derived consistently across all the
genomes, are available from our prokaryote pseudogene
information website [
Results and discussion
Pseudogenes are pervasive in prokaryotes
To identify pseudogenes in prokaryotic genomes, we
performed a conservative and comprehensive search, as outlined
in Figure 1 and Materials and methods. We used a proteome
set consisting of sequences from the 64 genomes and
] with relatively high confidence in annotation (that
is, excluding those annotated as hypothetical proteins).
Intergenic regions in prokaryotic genomes were searched against
the proteome set using FastX [
] for homology matches with
disablements as pseudogene candidates. We then applied
several checks to reduce false positives (see Materials and
methods). Overall, we found 6,895 candidate pseudogenes.
Previously, the pseudogene fraction was defined as the ratio
of the number of pseudogenes to the number of all gene-like
sequences (genes plus pseudogenes) [
]. By this measure, we
find that pseudogenes are pervasive in prokaryotes (Figure
2). Pseudogenes are detectable at a low 'background' level in
most prokaryotes, ranging from 1 to 5% of the genome (Figure
2). Application of a more restrictive cutoff (E-value less than
0.001, instead of E-value less than 0.01) in FastX alignment
results in slightly smaller percentage of pseudogenes (0.1%
less on average) in all the genomes, and generates essentially
the same results (data not shown). Our census is in general
agreement with previous assessments of pseudogene content
in the genomes of M. leprae, Escherichia coli and Rickettsia
]. In these previous studies, however,
different criteria were used for pseudogene identification in
different genomes, leading to inconsistencies in comparing
results. This is avoided in our study by using a method applied
uniformly across all genomes. All these assessments suggest
that most prokaryotes have similar net genomic DNA deletion
rates, resulting in similar low-level 'background' pseudogene
fractions in their genomes.
To check for a correlation with microbial 'lifestyle', we
classified the 64 species into three categories: archaea, pathogenic
Halobacterium sp. NRC-1
L. lactis subsp. lactis
Nostoc sp. PCC 7120
Synechocystis sp. PCC 6803
E. coli K12
s N. meningitidis MC58
e N. meningitidis Z2491
om R. conorii
n M. pneumoniae
eG SS. .pTnyepuhmi oCnTia1e8
S. pyogenesM1 GAS
M. tuberculosis CDC1551
E. coli O157:H7 EDL933
M. tuberculosis H37Rv
Buchnera sp. APS
S. typhimurium LT2
H. pylori 26695
E. coli O157:H7
C. pneumoniae CWL029
C. pneumoniae AR39
S. aureus subsp. aureus N315
C. pneumoniae J138
S. aureus subsp. aureus Mu50
H. pylori J99
Pseudogene fraction (%)
FFriagcutrioen2s of pseudogenes in the 64 prokaryote genomes Fractions of pseudogenes in the 64 prokaryote genomes. The genomes are divided into three categories: archaea (green), non-pathogenic bacteria (blue) and pathogenic bacteria (purple). The yellow bars represent the fractions of pseudogenes that overlap with hypothetical ORFs, and the green bars represent those that do not overlap. Genomes in each category are sorted by the green bars.
bacteria and non-pathogenic bacteria. The pseudogene
fractions for these groupings were assessed. M. leprae has a very
large pseudogene fraction (36.5%) and is clearly a unique
outlier. When this genome is set aside, the three groups have
similar pseudogene fractions (3.6%, 3.9% and 3.3%). Note
that three other pathogenic species/strains have relatively
large pseudogene fractions, including Neisseria meningitidis
MC58 (12.4%), N. meningitidis Z2491 (11.6%) and Rickettsia
conorii (9.7%). The higher pseudogene fractions of some
pathogenic species have previously been suggested to be a
result of a rapidly changing environmental niche, with loss of
metabolic and respiratory pathways [
We found that about 2,300 of our 6,895 candidate
pseudogenes overlap with more than 2,600 annotated hypothetical
open reading frames (ORFs), whose fractions were indicated
in Figure 2. The overlap could arise from erroneous gene
annotations or sequencing errors [
]. In either case, the
pseudogene annotation in prokaryotic genomes is evidently
an important part of decontaminating gene annotation.
We used the Pfam classification [
] to analyze the families
and functions of candidate pseudogenes. The 20 top-ranking
domain families in terms of pseudogenes are shown in Figure
3a. Many large divergent gene families are among the top
pseudogene families, including 9 of the top 10 gene families
such as: the ABC transporter (PF00005), short-chain
dehydrogenases/reductases (PF00106), sugar transporter (major
facilitator superfamily) (PF00083), and histidine kinase-like
ATPase (PF02518). As the largest family of proteins in
prokaryotes, the ABC transporter functions to translocate a
variety of compounds across biological membranes [
consists of two ATP-binding domains (PF00005) [
two transmembrane domains (PF00664). These domains are
present in large copy numbers across genomes (2,172 and 245
gene copies as well as 67 and 13 pseudogene copies
There are notable protein families that rank high in
pseudogene number, but low in terms of gene number. They include
the PPE family (PF00823) which is thought to be linked to
antigenic variation in mycobacteria and is highly
]; the cytochromes P450 (PF00067), which are
involved in processing diverse substrates; the GGDEF
domain (PF00990), which is of unknown function and is
associated with a wide diversity of other protein domains
]; alpha/beta-hydrolase enzymes (PF00561), which have
diverse catalytic functions; and pseudo-U-synthase-2
enzymes (PF00849), which help synthesize pseudouridine
from uracil. Note that the first two families in this list have
sequence diversity that has some link to environmental
this relationship to be linear, with bigger families having
more pseudogenes, but Figure 3b shows this is not the case.
Two large families that have a relatively high ratio of
pseudogenes to genes are the transposase DDE domain (PF01609)
and integrase core domain (PF00665). Transposase
facilitates DNA transposition and horizontal gene transfer and its
DDE domain may be responsible for DNA cleavage at a
specific site followed by a strand-transfer reaction [
transposons contain transposases for their transposition
]. We found that two strains of N. meningitidis (MC58
and Z2491) carry 26 and 22 copies of transposase
pseudogenes, respectively, and have only 11 and 5 copies of
transposase genes. In the MC58 strain, transposase pseudogenes
have been found in most of the 29 remnant insertion
sequences . This suggests that N. meningitidis strains
probably undergo high selection pressure for transposases.
The integrase core domain family (PF00665) is the catalytic
domain of integrase, which mediates integration of a DNA
copy of a viral/bacteriophage genome into the host genome
]. It catalyzes the DNA strand-transfer reaction by ligating
the 3' ends of the viral DNA to the 5' ends of the integration
]. The large number of transposase and integrase
pseudogenes might result from harmful foreign genes being
disabled in transposable elements. Several species contain
many integrase pseudogenes, including Streptococcus
pneumoniae, M. leprae, M. tuberculosis, and E. coli strain
O157:H7. The large number of pseudogenes relative to genes
for these two gene families may reflect an overall high
selective pressure for them - that is, a gene family that is rapidly
duplicating and evolving may generate many pseudogenes.
Origins of pseudogenes
Retrotransposition and genomic DNA duplication generate
pseudogenes in mammals and other eukaryotes [2,3]. In
contrast, in prokaryotes, based on the experience annotating E.
coli and M. leprae [
], pseudogenes are suggested to arise
from three process: the disablement of detectable native
duplications; the decay of native single-copy host genes; and
failed horizontal transfers.
However, the complete extent of the processes forming
prokaryotic pseudogenes is not yet well understood. We
realize that there are many methods of defining horizontal
] and an active debate on the best way of doing this
], so we applied two independent methods to predict
horizontal gene transfer events. The first method
(GC-content) is based on the GC content bias at particular codon
positions of recently acquired genes [
]. The second method
(GeneTrace) is based on the analysis of phylogenetic
distribution of protein families on species tree . In the
GC-content method, the number of pseudogenes resulting from
horizontal transfer in each genome was estimated by applying
the same criteria to them as had been previously used to
identify horizontally transferred genes. Overall, we found that the
ratio (19.9%) of pseudogenes from potential horizontal
transfer to those derived from the host is significantly higher than
Top ranking pseudogene families by Pfam classification
FGiegnuer-eto3-pseudogene ratios Gene-to-pseudogene ratios. (a) The top 20 pseudogene families and top 10 gene families based on Pfam classification. Ranking is based on the size of pseudogene families. The top 10 gene families are highlighted with the green background. (b) The number of genes plotted against the number of pseudogenes in a Pfam family. The line represents the overall ratio of the number of pseudogenes to the number of genes in the 64 genomes.
PF01609 Transposase DDE domain
PF00665 Integrase core domain
R64.6 Genome Biology 2004, Volume 5, Issue 9, Article R64
Putative horizontally transferred genes and pseudogenes
All genes and pseudogenes and the fraction having atypical codon-position-specific GC contents in the 64 genomes studied. The failed horizontal
transfer index was computed as described in Materials and methods.
the ratio of genes in the host (8.6%). We dubbed the ratio of
these two quantities the 'failed horizontal transfer index', and
observed that it implies that pseudogenes are 2.3 times more
likely to arise from horizontal transfer than host genes are
To confirm our findings based on a method relying on GC
content bias we applied the GeneTrace method (see Materials
and methods). We analyzed a subset of pseudogenes and
found that 18% result from failed horizontal transfer events,
consistent with the previous method. Note that GeneTrace
and the GC-content method are very different in the criteria
they use to assess horizontal transfer and thus make for good
independent verification of each other.
In summary, we report here for the first time an estimate of
how often horizontal transfer in prokaryotes introduces genes
that are redundant, useless or even detrimental. Firstly, ORFs
from dangerous genetic elements are under strong selection
pressure to be deleted from the host's genome [
horizontally transferred genes have a higher chance than
non-transferred genes of becoming pseudogenes in most
prokaryotes, which may be a result of
deactivation/disablement of non-beneficial transferred genes.
By examining closely related strains of the same species, we
found that most close strains have a similar value for the
failed horizontal transfer index. In particular, M. tuberculosis
(strains H37Rv and CDC1551), N. meningitidis (strains Z1491
and MC8), and Helicobacter pylori (strains 26695 and J99)
share similar index values within species. However, E. coli
has different index values in the three strains studied. The
free-living E. coli K12 strain has an index value of 4.6,
comparable to values calculated from previous results [
], while the
two pathogenic E. coli strains O157:H7 and O157:H7 EDL933
have much lower values (1.8 and 0.8). This can be readily
explained in two ways: the intracellular pathogenic E. coli
strains could have moved into a different environment that
results in lower exposure to incoming DNA and thus to a
lower rate of horizontal gene transfer [
]; or these strains
could have an increased rate of gene loss or pseudogene
formation of their host genes.
A polygenomic power-law-like trend in pseudogene
To characterize the overall rate of decay of pseudogene
populations, we plotted the fraction of disablements versus the
average number of matching residues (to their closest
homologs) per pseudogene for each species. This measure
rbe sp 50
toTFhfihegaeuv6fe4rreaascgp4teiocmnieaostfcdhaitsneagborleridsziedrdeuseiinsdtutooesfoth(uperecrglro1os,0ue0pst0shroemsidouloegs)s vpeerrsupsetuhdeonguemneb einr
The fraction of disabled residues (per 1,000 residues) versus the number
of average matching residues to the closest homologs per pseudogene in
the 64 species categorized into four groups: archaea (blue diamonds),
nonpathogenic bacteria (green squares), obligate pathogenic bacteria (purple
circles) and non-obligate pathogenic bacteria (red triangles).
shows how the overall level of decay of a pseudogene
population relates to age (which corresponds to the degree of overall
match to the closest homologs). There is a general
power-lawlike behavior governing this measure, with recent
pseudogenes having few disablements and divergent pseudogenes
having many (Figure 4). Archaea and most non-pathogenic
bacteria cluster together at higher rates of disablement
(between 10 and 28 per 1,000 residues) and less significant
matches, indicating comparatively greater retention of
ancient gene remnants in those species and fewer young
pseudogenes. On the other hand, obligate pathogenic bacteria
tend to have younger pools of pseudogenes, even though they
exhibit high disablement rates. Interestingly, four species of
obligate bacterial pathogens clearly stand out from the
general tendency: these are M. leprae and three closely related
mycoplasma species: Mycoplasma pneumoniae,
Mycoplasma pulmonis and Ureaplasma urealyticum.
Pseudogenes in these four pathogenic bacteria carry several times
more disablements, suggesting that these bacteria have an
accelerated disabling mutation rate. It is known that M.
leprae has lost the dnaQ-mediated proofreading activities of
DNA polymerase III [
], which could contribute to a
higher mutation rate. The higher mutation rates in these
species might suggest that these pathogens are under adaptation
to their new environment, or have specific genome regions
that are hypermutable.
It is important to note here that the current sequence
databases are derived from an uneven sampling of genomes.
Therefore, genomes of organisms with more sequenced
relatives may appear to have, on average, a seemingly younger
population of pseudogenes, while others may appear to have
older and fewer identifiable pseudogenes. Using data from 64
genomes, our results indicate an overall trend for
pseudogenes observed in most of the genomes studied.
However, these results have to be viewed as preliminary until
more genome data is available.
We have shown that pseudogenes in prokaryotes are not
uncommon, occupying 1-5% of all gene-like sequences. We
find that specific gene families with clear links to DNA
transposition and environmental responses have higher
The pseudogene data has many implications for the study of
genome reduction and expansion [
]. A significant
proportion of the pseudogenes arose from putative failed
horizontal transfer - at more than two times the rate for genes.
Obligate pathogenic bacteria have high rates of disablement
in younger pseudogene populations, consistent with recent
accelerated genome reduction , while, in contrast,
archaea and non-pathogenic bacteria have relatively older
pseudogene populations, but similar rates of disablement.
In terms of methodological implications, it is evidently
necessary to include prokaryote pseudogenes as part of systematic
annotation pipelines in the future. In addition, it was also
shown to be helpful to identify potential short ORFs [
Furthermore, our survey shows that trends can be observed
'polygenomically' for prokaryotes, where they are not obvious
or significant in individual genomes.
Materials and methods
Database releases used
We used the following datasets in our prokaryotic
pseudogene analysis: Swiss-Prot (release 40.19 and updated to 27
May, 2002) [
] containing 43,094 prokaryotic protein
sequences; nucleotide sequences from 64 prokaryotic
genomes from EMBL database release 70 on March-2002
], including 11 genomes from archaea and 53 from bacteria
as listed in Figure 1; Pfam release 7.3 of May 2002, containing
3,849 families and 498,152 protein domains in the
Pseudogene identification pipeline
Figure 1a shows the basic procedure for identifying
prokaryotic pseudogenes. The general schema was adapted from
pipelines for pseudogene analysis in eukaryotes . We generated
a prokaryotic proteome set by collecting all the prokaryotic
protein sequences in the Swiss-Prot database and those
annotated in the 64 prokaryotic genomes. To be conservative, we
did not include hypothetical or putative proteins, a large
proportion of which might be overannotated [
]. All the
protein sequences were masked by SEG using the default
lowcomplexity filter parameters (122.22.5) . To maximize the
efficiency of the pseudogene search, we only considered the
intergenic DNA regions in the 64 prokaryote genomes
(including the regions encoding hypothetical proteins) as
query sequences, and searched their forward and reverse
complement sequences against the proteome set using FastX
]. Significant homology matches (E-value less than 0.01)
that contained more than one disablement (either a
frameshift caused by insertion or deletion of nucleotides or a
premature stop codon) were considered as potential
pseudogenes. If an intergenic region had multiple matches, these
matches were sorted by E-value (increasing) and then by the
number of matching residues (decreasing), if they have the
same E-value. The match with the most significant E-value
and the maximum matching residues was selected and
redundant matches were removed.
To ensure that spurious disablements were not introduced at
ends of sequences as an alignment artifact, we excluded
homology matches whose disablements occurred only within
a 'cutoff region' at either end. We used 16 residues for the
cutoff region for short sequences (160 amino acids or fewer) - a
parameter that has been applied previously . For longer
sequences (more than 160 amino acids), 10% of the sequence
length was applied as the cutoff region as FastX tends to
include more residues at the ends of alignments.
We also assessed the potential pseudogenes by examining the
distribution of the disablements within pseudogene
sequences. Given that mutations within pseudogenes are
unconstrained, we would expect disablements on
pseudogenes to be evenly distributed. Figure 1b shows the position of
disablements within pseudogene fragments whose length is
normalized to 100 residues. By removing those potential
pseudogenes that only had disablements at their flanking
regions at both ends, the distribution is almost evenly
distributed. We used it as a 'control filter' to minimize false-positive
pseudogenes. In the final pseudogene set, the length of
pseudogenes ranges from 33 to 4,969 amino acids, with a median
length of 130 amino acids, as compared with the proteome
set, where the length ranges from 7 to 10,920 amino acids
with a median length of 291 amino acids.
We considered non-standard codon usage in some bacteria,
such as when TGA encodes tryptophan rather than a stop
codon in mycoplasma species, including Mycoplasma
pneumoniae, M. pulmonis and U. urealyticum. By manual
examination of E. coli genes with translational frameshifts in the
RECODE database [
], we found that those genes were
included in coding sequences (CDS) and therefore were
excluded from our pseudogene search.
Sequencing errors could also be a potential problem in the
detection of pseudogenes. However, this effect is expected to
be small, as comparison of independently sequenced isolates
of the same E. coli strains indicated that only about 7% of
candidate pseudogenes could be due to sequencing error [
further consider the possibility of sequencing error, we
examined the stop codons in the pseudogenes detected in the S.
pneumoniae genome (frameshift positions are not
considered as they are difficult to locate.). This genome and eight
others found in the trace archive of the National Center for
Biotechnology Information (NCBI) [
] and Ensembl [
were all sequenced by TIGR. We selected S. pneumoniae as a
case study as it is a relatively big genome available in the
archive. By adapting a previous method [
], we examined
the overall quality values (Q) for each nucleic acid of stop
codons in the pseudogenes. Pseudogene sequences were
aligned to the archived sequences (≥ 95% identity), and the
quality values for nucleotides in stop codons were summed
up. We chose 10-2 as a cutoff of the error rate (err =
10SUM(0.1Q)) for all nucleic acids. The stop codons with all three
nucleic acids above the cutoff were validated. Out of 116
pseudogenes in this genome, 73 were found to contain 150 stop
codons in total. Using the available data in the trace archive,
we identified 54 pseudogenes with stop codons being aligned
with the original sequences, and validated 47 of these (87%).
In addition, a similar fraction of stop codons (101 out of 116)
Family classification of genes and pseudogenes
All genes in the 64 genomes were assigned to Pfam families by
cross-referencing of their Swiss-Prot ID. Pseudogenes were
assigned to Pfam families through ID of their closest
homologs. Only the homologs that cover more than 70% of
the Pfam domain were selected. A pseudogene could be
assigned to multiple Pfam families if it contains multiple
Estimation of horizontally transferred genes and
Here we used a method (GC-content) to estimate horizontal
transferred genes on the basis of their base compositions
]. We analyzed each of the 64 genomes individually,
and atypical genes and pseudogenes were identified if the GC
content at first and third codon positions was two or more
standard deviations higher or lower than the mean values at
those positions in genes.
To ensure that we had the codon positions accurately
assigned for the GC-content method, we only analyzed
codons for pseudogenes that aligned well with annotated
protein sequences, specifically excluding the regions of the
alignment around frameshifts. While it is true that the local
alignment in some regions of a pseudogene may be
ambiguous, causing some difference in the GC-content calculation in
that region, the impact on the overall GC-content estimation
is minimal, given how many positions we average over to
calculate the failed transfer index score.
The results for the 64 genomes are shown in Table 1. The
failed transferred index in the last column represents the
ratio of the fraction of putative horizontally transferred
pseudogenes to the fraction of horizontally transferred genes
R64.10 Genome Biology 2004, Volume 5, Issue 9, Article R64
NumHT ,ψ Gene
similar to the measure previously used in E. coli [
essentially gives a likelihood ratio for horizontal transfer for
pseudogenes relative to that of genes.
Note that to minimize the effect of more divergent sequence
alignments, for the horizontal-transfer calculations we only
analyzed 1,748 'recent' pseudogenes, which have more than
50% sequence identity to their closest matches over an
aligned subsequence of more than 100 residues.
We have investigated the statistical robustness of the failed
transfer index using resampling approaches [
]. For each of
the 64 genomes, we randomly picked 90% of its genes and
calculated their GC content. Using the new GC content, we
then identified the putative horizontally transferred genes
and pseudogenes and calculated the failed transfer index. We
applied the process 1,000 times, generating a distribution of
1,000 indexes, which has a mean value of 2.32 with standard
deviation of 0.01.
We also applied an alternative method (GeneTrace) to
estimate horizontally transferred pseudogenes [
]. In this
method, potential horizontal transfer events are inferred
within a protein family when it is present only in distantly
related species and is absent from members of the same
phylogenetic clade. We analyzed a subset of pseudogenes - 225
pseudogenes across 62 genomes - whose closest Swiss-Prot
homologs share more than 70% sequence identity across at
least 100 amino acids, and identified 41 of them (18%) as from
failed horizontal transfer events.
M.G. thanks NIH/NIAID grant for Northeast Biodefense Center
(1U54AI057158-01) for financial support. He also acknowledges support
from the Ruth B. Williams Fund. Y.L. was partially supported by an NLM
postdoctoral fellowship (NIH Grant T15 LM07056). We thank Zhaolei
Zhang and Nick Carriero for helpful discussions and Duncan Milburn for technical help.
Vanin EF : Processed pseudogenes: characteristics and evolution . Annu Rev Genet 1985 , 19 : 253 - 272 .
Mighell AJ , Smith NR , Robinson PA , Markham AF : Vertebrate pseudogenes . FEBS Lett 2000 , 468 : 109 - 114 .
Harrison PM , Gerstein M : Studying genomes through the aeons: protein families, pseudogenes and proteome evolution . J Mol Biol 2002 , 318 : 1155 - 1174 .
Harrison PM , Echols N , Gerstein MB : Digging for dead genes: an analysis of the characteristics of the pseudogene population in the Caenorhabditis elegans genome . Nucleic Acids Res 2001 , 29 : 818 - 830 .
Harrison P , Kumar A , Lan N , Echols N , Snyder M , Gerstein M : A small reservoir of disabled ORFs in the yeast genome and its implications for the dynamics of proteome evolution . J Mol Biol 2002 , 316 : 409 - 419 .
Harrison PM , Hegyi H , Balasubramanian S , Luscombe NM , Bertone P , Echols N , Johnson T , Gerstein M : Molecular fossils in the human genome: identification and analysis of the pseudogenes in chromosomes 21 and 22 . Genome Res 2002 , 12 : 272 - 280 .
7. Zhang Z , Harrison P , Gerstein M : Identification and analysis of over 2000 ribosomal protein pseudogenes in the human genome . Genome Res 2002 , 12 : 1466 - 1482 .
8. Harrison PM , Milburn D , Zhang Z , Bertone P , Gerstein M : Identification of pseudogenes in the Drosophila melanogaster genome . Nucleic Acids Res 2003 , 31 : 1033 - 1037 .
9. Ohshima K , Hattori M , Yada T , Gojobori T , Sakaki Y , Okada N : Whole-genome screening indicates a possible burst of formation of processed pseudogenes and Alu repeats by particular L1 subfamilies in ancestral primates . Genome Biol 2003 , 4 : R74 .
10. Torrents D , Suyama M , Zdobnov E , Bork P : A genome-wide survey of human pseudogenes . Genome Res 2003 , 13 : 2559 - 2567 .
11. Lawrence JG , Hendrix RW , Casjens S : Where are the pseudogenes in bacterial genomes? Trends Microbiol 2001 , 9 : 535 - 540 .
12. Cole ST , Eiglmeier K , Parkhill J , James KD , Thomson NR , Wheeler PR , Honore N , Garnier T , Churcher C , Harris D , et al.: Massive gene decay in the leprosy bacillus . Nature 2001 , 409 : 1007 - 1011 .
13. Prokaryote Pseudogene Information Site [http://prokaryo tes.pseudogene.org]
14. Bairoch A , Apweiler R : The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 . Nucleic Acids Res 2000 , 28 : 45 - 48 .
15. Pearson WR , Lipman DJ : Improved tools for biological sequence comparison . Proc Natl Acad Sci USA 1988 , 85 : 2444 - 2448 .
16. Homma K , Fukuchi S , Kawabata T , Ota M , Nishikawa K : A systematic investigation identifies a significant number of probable pseudogenes in the Escherichia coli genome . Gene 2002 , 294 : 25 - 33 .
17. Andersson SG , Zomorodipour A , Andersson JO , Sicheritz-Ponten T , Alsmark UC , Podowski RM , Naslund AK , Eriksson AS , Winkler HH , Kurland CG : The genome sequence of Rickettsia prowazekii and the origin of mitochondria . Nature 1998 , 396 : 133 - 140 .
18. Andersson JO , Andersson SG : Pseudogenes, junk DNA , and the dynamics of Rickettsia genomes . Mol Biol Evol 2001 , 18 : 829 - 839 .
19. Casjens S , Palmer N , van Vugt R , Huang WM , Stevenson B , Rosa P , Lathigra R , Sutton G , Peterson J , Dodson RJ , et al.: A bacterial genome in flux: the twelve linear and nine circular extrachromosomal DNAs in an infectious isolate of the Lyme disease spirochete Borrelia burgdorferi . Mol Microbiol 2000 , 35 : 490 - 516 .
20. Bateman A , Birney E , Durbin R , Eddy SR , Howe KL , Sonnhammer EL : The Pfam protein families database . Nucleic Acids Res 2000 , 28 : 263 - 266 .
21. Guidotti G : ATP transport and ABC proteins . Chem Biol 1996 , 3 : 703 - 706 .
22. Nikaido H , Hall JA : Overview of bacterial ABC transporters . Methods Enzymol 1998 , 292 : 3 - 20 .
23. Kerr ID : Structure and association of ATP-binding cassette transporter nucleotide-binding domains . Biochim Biophys Acta 2002 , 1561 : 47 - 64 .
24. Higgins CF , Hiles ID , Salmond GP , Gill DR , Downie JA , Evans IJ , Holland IB , Gray L , Buckel SD , Bell AW , et al.: A family of related ATP-binding subunits coupled to many distinct biological processes in bacteria . Nature 1986 , 323 : 448 - 450 .
25. Higgins CF , Hyde SC , Mimmack MM , Gileadi U , Gill DR , Gallagher MP : Binding protein-dependent transport systems . J Bioenerg Biomembr 1990 , 22 : 571 - 592 .
26. Fleischmann RD , Alland D , Eisen JA , Carpenter L , White O , Peterson J , DeBoy R , Dodson R , Gwinn M , Haft D , et al.: Whole-genome comparison of Mycobacterium tuberculosis clinical and laboratory strains . J Bacteriol 2002 , 184 : 5479 - 5490 .
27. Pei J , Grishin NV : GGDEF domain is homologous to adenylyl cyclase . Proteins 2001 , 42 : 210 - 216 .
28. DasSarma S : Identification and analysis of the gas vesicle gene cluster on an unstable plasmid of Halobacterium halobium . Experientia 1993 , 49 : 482 - 486 .
29. Brown NL , Evans LR : Transposition in prokaryotes: transposon Tn501 . Res Microbiol 1991 , 142 : 689 - 700 .
30. Reznikoff WS : The Tn5 transposon . Annu Rev Microbiol 1993 , 47 : 945 - 963 .
31. Tettelin H , Saunders NJ , Heidelberg J , Jeffries AC , Nelson KE , Eisen JA , Ketchum KA , Hood DW , Peden JF , Dodson RJ , et al.: Complete genome sequence of Neisseria meningitidis serogroup B strain MC58 . Science 2000 , 287 : 1809 - 1815 .
32. Dyda F , Hickman AB , Jenkins TM , Engelman A , Craigie R , Davies DR : Crystal structure of the catalytic domain of HIV-1 integrase: similarity to other polynucleotidyl transferases . Science 1994 , 266 : 1981 - 1986 .
33. Lawrence JG , Ochman H : Amelioration of bacterial genomes: rates of change and exchange . J Mol Evol 1997 , 44 : 383 - 397 .
34. Karlin S : Global dinucleotide signatures and analysis of genomic heterogeneity . Curr Opin Microbiol 1998 , 1 : 598 - 610 .
35. Mrazek J , Karlin S : Detecting alien genes in bacterial genomes . Ann NY Acad Sci 1999 , 870 : 314 - 329 .
36. Hayes WS , Borodovsky M : How to interpret an anonymous bacterial genome: machine learning approach to gene identification . Genome Res 1998 , 8 : 1154 - 1171 .
37. Ragan MA : On surrogate methods for detecting lateral gene transfer . FEMS Microbiol Lett 2001 , 201 : 187 - 191 .
38. Lawrence JG , Ochman H : Reconciling the many faces of lateral gene transfer . Trends Microbiol 2002 , 10 : 1 - 4 .
39. Lawrence JG , Ochman H : Molecular archaeology of the Escherichia coli genome . Proc Natl Acad Sci USA 1998 , 95 : 9413 - 9417 .
40. Kunin V , Ouzounis CA : GeneTRACE-reconstruction of gene content of ancestral species . Bioinformatics. Bioinformatics 2003 , 19 : 1412 - 1416 .
41. Wernegreen JJ , Ochman H , Jones IB , Moran NA : Decoupling of genome size and sequence divergence in a symbiotic bacterium . J Bacteriol 2000 , 182 : 3867 - 3869 .
42. Mizrahi V , Dawes SS , Rubin H : In Molecular Genetics of Mycobacteria Edited by: Hatfull GF , Jacobs WR Jr. Washington, DC: American Society for Microbiology; 2000 : 159 - 172 .
43. Andersson SG , Alsmark C , Canback B , Davids W , Frank C , Karlberg O , Klasson L , Antoine-Legault B , Mira A , Tamas I : Comparative genomics of microbial pathogens and symbionts . Bioinformatics 2002 , 18 ( Suppl 2 ): S17 .
44. Moran NA : Microbial minimalism: genome reduction in bacterial pathogens . Cell 2002 , 108 : 583 - 586 .
45. Harrison PM , Carriero N , Liu Y , Gerstein M : A "polyORFomic" analysis of prokaryote genomes using disabled-homology filtering reveals conserved but undiscovered short ORFs . J Mol Biol 2003 , 333 : 885 - 892 .
46. Stoesser G , Baker W , van den Broek A , Camon E , Garcia-Pastor M , Kanz C , Kulikova T , Leinonen R , Lin Q , Lombard V , et al.: The EMBL Nucleotide Sequence Database . Nucleic Acids Res 2002 , 30 : 21 - 26 .
47. Skovgaard M , Jensen LJ , Brunak S , Ussery D , Krogh A : On the total number of genes and their length distribution in complete microbial genomes . Trends Genet 2001 , 17 : 425 - 428 .
48. Ochman H : Distinguishing the ORFs from the ELFs: short bacterial genes and the annotation of genomes . Trends Genet 2002 , 18 : 335 - 337 .
49. Wootton JC , Federhen S : Statistics of local complexity in amino acid sequences and sequence databases . Comput Chem 1993 , 17 : 149 - 163 .
50. Baranov PV , Gurvich OL , Fayet O , Prere MF , Miller WA , Gesteland RF , Atkins JF , Giddings MC : RECODE: a database of frameshifting, bypassing and codon redefinition utilized for gene expression . Nucleic Acids Res 2001 , 29 : 264 - 267 .
51. NCBI trace archive [http://www.ncbi.nlm.nih.gov/Traces]
52. Ensembl trace archive [http://trace.ensembl.org]
53. Read TD , Salzberg SL , Pop M , Shumway M , Umayam L , Jiang L , Holtzapple E , Busch JD , Smith KL , Schupp JM , et al.: Comparative genome sequencing for discovery of novel polymorphisms in Bacillus anthracis . Science 2002 , 296 : 2028 - 2033 .
54. Efron B , Tibshirani R : Statistical data analysis in the computer age . Science 1991 , 253 : 390 - 395 .