Genome-level homology and phylogeny of Shewanella (Gammaproteobacteria: lteromonadales: Shewanellaceae)
Rebecca B Dikow
0
1
0
Committee on Evolutionary Biology, The University of Chicago
,
Chicago, IL
,
USA
1
Division of Fishes, The Field Museum of Natural History
,
Chicago, IL
,
USA
Background: The explosion in availability of whole genome data provides the opportunity to build phylogenetic hypotheses based on these data as well as the ability to learn more about the genomes themselves. The biological history of genes and genomes can be investigated based on the taxomonic history provided by the phylogeny. A phylogenetic hypothesis based on complete genome data is presented for the genus Shewanella (Gammaproteobacteria: Alteromonadales: Shewanellaceae). Nineteen taxa from Shewanella (16 species and 3 additional strains of one species) as well as three outgroup species representing the genera Aeromonas (Gammaproteobacteria: Aeromonadales: Aeromonadaceae), Alteromonas (Gammaproteobacteria: Alteromonadales: Alteromonadaceae) and Colwellia (Gammaproteobacteria: Alteromonadales: Colwelliaceae) are included for a total of 22 taxa. Results: Putatively homologous regions were found across unannotated genomes and tested with a phylogenetic analysis. Two genome-wide data-sets are considered, one including only those genomic regions for which all taxa are represented, which included 3,361,015 aligned nucleotide base-pairs (bp) and a second that additionally includes those regions present in only subsets of taxa, which totaled 12,456,624 aligned bp. Alignment columns in these large data-sets were then randomly sampled to create smaller data-sets. After the phylogenetic hypothesis was generated, genome annotations were projected onto the DNA sequence alignment to compare the historical hypothesis generated by the phylogeny with the functional hypothesis posited by annotation. Conclusions: Individual phylogenetic analyses of the 243 locally co-linear genome regions all failed to recover the genome topology, but the smaller data-sets that were random samplings of the large concatenated alignments all produced the genome topology. It is shown that there is not a single orthologous copy of 16S rRNA across the taxon sampling included in this study and that the relationships among the multiple copies are consistent with 16S rRNA undergoing concerted evolution. Unannotated whole genome data can provide excellent raw material for generating hypotheses of historical homology, which can be tested with phylogenetic analysis and compared with hypotheses of gene function.
-
Background
Shewanella is a genus of marine and freshwater
gramnegative Gammaproteobacteria within the monogeneric
family Shewanellaceae Ivanova et al., 2004. While
members of Shewanella have been recognized since 1931 (e.
g. Achromobacter putrefaciens Derby and Hammer 1931
now Shewanella putrefaciens), the genus Shewanella has
only been recognized with its present name since 1985
[1] and 39 of the 52 currently recognized species have
been described since 2000 [2]. There are also multiple
strains that are commonly studied but have not been
given a proper name (some of these have been included
below and will be referred to by their strain number).
Members of Shewanella have been described from
diverse habitats, including deep cold-water marine
environments to shallow Antarctic Ocean habitats to
hydrothermal vents and freshwater lakes (see Table 1[1,3-21]).
Shewanella has been of great interest due to the ability
Table 1 Taxon table and Mauve results
Shewanella baltica OS223
Shewanella baltica OS155
Shewanella baltica OS185
Shewanella baltica OS195
Shewanella pealeana ATCC
700345
Shewanella piezotolerans WP3
Shewanella putrefaciens CN-32
Shewanella sediminis HAW-EB3
Shewanella sp. ANA-3
LCBs present in all taxa
of its species to convert heavy metals and toxic
substances (e.g. iron, sulfur, uranium) into less toxic
products by using them as electron acceptors in certain
respiratory situations, making them of interest for
environmental clean-up (e.g. iron, sulfur: [22]; uranium: [23]).
To this end, 19 genomes have been fully sequenced and
deposited on GenBank as of 2009. Annotations suggest
that species possess approximately 5,000 genes and have
genomes of approximately 5 Mbp (details in Table 1).
The goal of the study presented here is to investigate
how we can use whole genome data, not only to build a
tree but to inform us of gene and genome history by
comparing the hypothesis of historical homology
supported by the phylogenetic hypothesis to what is known
about gene function. There is a computational interest
in the ability to build large trees, both in number of
taxa and number of characters, e.g. [24,25]. The
biological history of genes and genomes can be investigated
based on the taxomonic history of the bearers of these
characters. This goes further than just the prediction of
function of uncharacterized genes, but also includes the
potential to track changing function over gene history
and finding up- or down-stream segments of
co-evolving DNA. Eisen and Fraser highlighted many of these
goals when they introduced the term phylogenomics
[26]. While these goals are broad and ambitious, it is
the hope that the present study represents a step in this
direction.
The presented approach also represents a shift for
phylogenetic systematics, in which historically one has
generally known all the characters of interest very well
and perhaps had a well-formed opinion about their
history based on a lifetime of knowledge about their
distribution and subtle variations. Even with molecular
characters in the form of one or a few genes, even with
many taxa, one gets to know the reliable parts of an
alignment and often memorizes the DNA sequence after
having sequenced and edited the same marker for
several years. The approach presented here proposes a new
perspective which is obligated by the new kinds of data
being gathered, particularly those from next-generation
and shotgun sequencing, which generate millions of
nucleotide base-pairs (bp) as opposed to thousands.
Primary homology (sensu dePinna, [27]) must be
determined in an automated fashion given the vast amount
of data and the few character states of nucleotide data.
The phylogenetic tree becomes an intermediate point
it is built based on hypotheses of primary homology,
which it tests, and then is used as a framework for
optimizing the character states and looking back to
functional gene annotations to begin to answer questions
about gene and genome history. Polymerase chain
reaction (PCR) primers can provide hypotheses of primary
homology, as amplifications using primers target
conserved flanking regions, which provide a sufficient level
of confidence that the same regions are being
sequenced. With next-generation sequencing, we have
no such sense of location (particularly with bacteria), as
we expect rearrangement of genes or other genomic
segments over evolutionary history [28-31]. Annotations
can provide information about the function of genes
and the location of op (...truncated)