Identification of compositionally distinct regions in genomes using the centroid method
BIOINFORMATICS ORIGINAL PAPER
Vol. 23 no. 20 2007, pages 2672–2677
doi:10.1093/bioinformatics/btm405
Sequence analysis
Identification of compositionally distinct regions in genomes
using the centroid method
Issaac Rajan1, Sarang Aravamuthan2 and Sharmila S. Mande1,*
1
Life Sciences Research and 2e-Security R&D, Advanced Technology Centre, Tata Consultancy Services, Hyderabad
500 081, Andhra Pradesh, India
Received on January 23, 2007; revised on July 16, 2007; accepted on August 6, 2007
Associate Editor: Burkhard Rost
ABSTRACT
Motivation: It is known that most genomic regions of special interest,
e.g. horizontally acquired sequences, genomic islands, etc. have
distinct word (m-mer) compositions. Most of the earlier work along
this direction, addressed di- and tri-nucleotide compositions. We
present an approach that can be applied to analyze compositions
of any given word size. The method, called the centroid approach,
can reveal compositionally distinct regions in genomic sequences
for any given word size.
Results: We applied our method to 50 bacterial genomes and
demonstrated its ability to identify embedded sequences of varying
lengths from distantly related organisms. We also investigated the
genetic makeup of the regions identified as compositionally distinct
by our method, for four organisms from our dataset. Pathogenicity
island (PAI) components and genes encoding strain-specific proteins
are all frequently seen to be constituents of these regions.
Availability: Program is available on request from the authors.
Contact:
Supplementary information: Supplementary data are available at
Bioinformatics online.
1
INTRODUCTION
Following their discovery in uropathogenic Escherichia coli,
pathogenicity islands (PAIs) (Hacker et al., 1990) have been
identified and intensely studied in other bacterial genomes.
Subsequently, other large segments, similar to PAIs, within
prokaryotic genomes were observed encoding various specialized functions. Examples of these functions include secondary
metabolism (metabolic islands), antibiotic resistance (resistance
islands) and secretion (secretion islands). These genomic
substructures are referred to as ‘genomic islands’. Islands often
possess transposons, phage sequences and clusters of genes
which perform related functions or participate in related
pathways. In general, islands are flanked by direct/inverted
repeats and have tRNA or tmRNA in their proximity. There is
substantial evidence to suggest their acquisition through
horizontal origin (Blum et al., 1994; Sullivan and Ronson, 1998).
Several groups have used annotation-based features to
identify genomic islands. For instance, tRNA and tmRNA
*To whom correspondence should be addressed.
2672
were used as initial leads to identify genomic islands
(Mantri and Williams, 2004). Similar feature-based approaches
include efforts by Ou et al. (2006) and Nag et al. (2006).
Although these approaches are sound, an obvious constraint is
set by the availability of well-annotated genomic sequences.
An incorrect or less rigorous annotation can severely hamper
the outcome of these methods. Another limitation is that since
these methods seek certain biological features, islands devoid
of these features are likely to be overlooked. Methods that use a
more intrinsic attribute (such as genome composition) are
devoid of these limitations. These methods are based on the
hypothesis that genomic islands possess distinct composition,
as compared to the rest of the genome. Karlin (2001) proposed
several strategies based on the compositional aspects of the
genome for the identification of anomalous gene clusters and
PAIs in diverse bacterial genomes. Zhang et al. (2001) proposed
a windowless method for the GC content computation, termed
as the cumulative GC profile and applied it for the identification
of genomic islands in Corynebacterium glutamicum and Vibrio
vulnificus CMCP6 chromosome I (Zhang and Zhang, 2004).
Tu and Ding (2003) used iterative discriminant analysis to define
genomic regions that deviate the most from the rest of the
genome based on three compositional criteria, namely, GþC
content, dinucleotide frequency and codon usage. Besides, these
successes in analyzing genomes using words of size 2 or 3, it is
generally acknowledged that larger word sizes (5–9) characterize
genomes better (Deschavanne et al., 2000; Sandberg et al., 2001).
In this article, we present an approach (called the centroid
method) that enables identification of compositionally distinct
regions in genomes for any word size. We also show, through
examples, that this method is able to identify embedded foreign
sequences in genomes. Finally, we analyze the DNA content
of the genome composition outlier bins for four of the
organisms from our dataset and comment on the biological
nature of the centroid-defined ‘alien DNA’.
2
2.1
METHODS
The centroid method
In the centroid method, we first partition the genome sequence into
non-overlapping bins of equal length and associate an n-tuple with each
bin. Here, n is the number of distinct m-words (words of length m). For
ß The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email:
Advance Access publication August 27, 2007
Identification of compositionally distinct regions
a given word, there are four possible symbols {A, G, C and T} for each
letter of the word. Hence for a given word of size m letters, the base
distribution frequency of a genomic fragment can be represented in an
n-dimensional space, where n ¼ 4m. For example, for a word of size 5
letters, n ¼ 45 ¼ 1024. These vectors are viewed as points in an
n-dimensional space. We then determine the centroid of these points.
The distance from the centroid is used as the criterion for determining
the outliers among these points. The outliers correspond to the
compositionally distinct bins.
The steps in the centroid method are given below:
(1) The genome of interest is partitioned into non-overlapping bins
of equal size.
(3) The average frequency of each word across all bins is computed.
The vector of these averages is the centroid.
(4) For each bin, the distance between its word frequency vector and
the centroid is computed (see below).
(5) Based on the distribution of distances of the bins from the
centroid, a suitable outlier selection criterion is defined in order
to identify outliers among the bins.
2.4
(6) Steps 1–5 are repeated for varying offsets from the start position
of the genome. While doing so, the regions identified as
compositionally distinct for all the different offsets should be
combined.
In order to test whether our method is able to identify an embedded
sequence coming from a closely related organism, we constructed
genomic chimeras, wherein we implanted a 20-kb insert from a closely
related genome (donor organism) between positions 50 001 and 70 000
in the recipient genome. The chimeras (Recipient–Donor) constructed
were: (E.coli CFT07 (...truncated)