Identification of compositionally distinct regions in genomes using the centroid method (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/bioinformatics/article-pdf/23/20/2672/49817386/bioinformatics_23_20_2672.pdf

Identification of compositionally distinct regions in genomes using the centroid method

BIOINFORMATICS ORIGINAL PAPER Vol. 23 no. 20 2007, pages 2672–2677 doi:10.1093/bioinformatics/btm405 Sequence analysis Identification of compositionally distinct regions in genomes using the centroid method Issaac Rajan1, Sarang Aravamuthan2 and Sharmila S. Mande1,* 1 Life Sciences Research and 2e-Security R&D, Advanced Technology Centre, Tata Consultancy Services, Hyderabad 500 081, Andhra Pradesh, India Received on January 23, 2007; revised on July 16, 2007; accepted on August 6, 2007 Associate Editor: Burkhard Rost ABSTRACT Motivation: It is known that most genomic regions of special interest, e.g. horizontally acquired sequences, genomic islands, etc. have distinct word (m-mer) compositions. Most of the earlier work along this direction, addressed di- and tri-nucleotide compositions. We present an approach that can be applied to analyze compositions of any given word size. The method, called the centroid approach, can reveal compositionally distinct regions in genomic sequences for any given word size. Results: We applied our method to 50 bacterial genomes and demonstrated its ability to identify embedded sequences of varying lengths from distantly related organisms. We also investigated the genetic makeup of the regions identified as compositionally distinct by our method, for four organisms from our dataset. Pathogenicity island (PAI) components and genes encoding strain-specific proteins are all frequently seen to be constituents of these regions. Availability: Program is available on request from the authors. Contact: Supplementary information: Supplementary data are available at Bioinformatics online. 1 INTRODUCTION Following their discovery in uropathogenic Escherichia coli, pathogenicity islands (PAIs) (Hacker et al., 1990) have been identified and intensely studied in other bacterial genomes. Subsequently, other large segments, similar to PAIs, within prokaryotic genomes were observed encoding various specialized functions. Examples of these functions include secondary metabolism (metabolic islands), antibiotic resistance (resistance islands) and secretion (secretion islands). These genomic substructures are referred to as ‘genomic islands’. Islands often possess transposons, phage sequences and clusters of genes which perform related functions or participate in related pathways. In general, islands are flanked by direct/inverted repeats and have tRNA or tmRNA in their proximity. There is substantial evidence to suggest their acquisition through horizontal origin (Blum et al., 1994; Sullivan and Ronson, 1998). Several groups have used annotation-based features to identify genomic islands. For instance, tRNA and tmRNA *To whom correspondence should be addressed. 2672 were used as initial leads to identify genomic islands (Mantri and Williams, 2004). Similar feature-based approaches include efforts by Ou et al. (2006) and Nag et al. (2006). Although these approaches are sound, an obvious constraint is set by the availability of well-annotated genomic sequences. An incorrect or less rigorous annotation can severely hamper the outcome of these methods. Another limitation is that since these methods seek certain biological features, islands devoid of these features are likely to be overlooked. Methods that use a more intrinsic attribute (such as genome composition) are devoid of these limitations. These methods are based on the hypothesis that genomic islands possess distinct composition, as compared to the rest of the genome. Karlin (2001) proposed several strategies based on the compositional aspects of the genome for the identification of anomalous gene clusters and PAIs in diverse bacterial genomes. Zhang et al. (2001) proposed a windowless method for the GC content computation, termed as the cumulative GC profile and applied it for the identification of genomic islands in Corynebacterium glutamicum and Vibrio vulnificus CMCP6 chromosome I (Zhang and Zhang, 2004). Tu and Ding (2003) used iterative discriminant analysis to define genomic regions that deviate the most from the rest of the genome based on three compositional criteria, namely, GþC content, dinucleotide frequency and codon usage. Besides, these successes in analyzing genomes using words of size 2 or 3, it is generally acknowledged that larger word sizes (5–9) characterize genomes better (Deschavanne et al., 2000; Sandberg et al., 2001). In this article, we present an approach (called the centroid method) that enables identification of compositionally distinct regions in genomes for any word size. We also show, through examples, that this method is able to identify embedded foreign sequences in genomes. Finally, we analyze the DNA content of the genome composition outlier bins for four of the organisms from our dataset and comment on the biological nature of the centroid-defined ‘alien DNA’. 2 2.1 METHODS The centroid method In the centroid method, we first partition the genome sequence into non-overlapping bins of equal length and associate an n-tuple with each bin. Here, n is the number of distinct m-words (words of length m). For ß The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: Advance Access publication August 27, 2007 Identification of compositionally distinct regions a given word, there are four possible symbols {A, G, C and T} for each letter of the word. Hence for a given word of size m letters, the base distribution frequency of a genomic fragment can be represented in an n-dimensional space, where n ¼ 4m. For example, for a word of size 5 letters, n ¼ 45 ¼ 1024. These vectors are viewed as points in an n-dimensional space. We then determine the centroid of these points. The distance from the centroid is used as the criterion for determining the outliers among these points. The outliers correspond to the compositionally distinct bins. The steps in the centroid method are given below: (1) The genome of interest is partitioned into non-overlapping bins of equal size. (3) The average frequency of each word across all bins is computed. The vector of these averages is the centroid. (4) For each bin, the distance between its word frequency vector and the centroid is computed (see below). (5) Based on the distribution of distances of the bins from the centroid, a suitable outlier selection criterion is defined in order to identify outliers among the bins. 2.4 (6) Steps 1–5 are repeated for varying offsets from the start position of the genome. While doing so, the regions identified as compositionally distinct for all the different offsets should be combined. In order to test whether our method is able to identify an embedded sequence coming from a closely related organism, we constructed genomic chimeras, wherein we implanted a 20-kb insert from a closely related genome (donor organism) between positions 50 001 and 70 000 in the recipient genome. The chimeras (Recipient–Donor) constructed were: (E.coli CFT07 (...truncated)