A profile-based method for identifying functional divergence of orthologous genes in bacterial genomes

Bioinformatics, Dec 2016

Motivation: Next generation sequencing technologies have provided us with a wealth of information on genetic variation, but predicting the functional significance of this variation is a difficult task. While many comparative genomics studies have focused on gene flux and large scale changes, relatively little attention has been paid to quantifying the effects of single nucleotide polymorphisms and indels on protein function, particularly in bacterial genomics. Results: We present a hidden Markov model based approach we call delta-bitscore (DBS) for identifying orthologous proteins that have diverged at the amino acid sequence level in a way that is likely to impact biological function. We benchmark this approach with several widely used datasets and apply it to a proof-of-concept study of orthologous proteomes in an investigation of host adaptation in Salmonella enterica. We highlight the value of the method in identifying functional divergence of genes, and suggest that this tool may be a better approach than the commonly used dN/dS metric for identifying functionally significant genetic changes occurring in recently diverged organisms. Availability and Implementation: A program implementing DBS for pairwise genome comparisons is freely available at: https://github.com/UCanCompBio/deltaBS. Contact: nicole.wheeler{at}pg.canterbury.ac.nz or lars.barquist{at}uni-wuerzburg.de Supplementary information: Supplementary data are available at Bioinformatics online.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://bioinformatics.oxfordjournals.org/content/32/23/3566.full.pdf

A profile-based method for identifying functional divergence of orthologous genes in bacterial genomes

Bioinformatics A profile-based method for identifying functional divergence of orthologous genes in bacterial genomes Nicole E. Wheeler 1 2 Lars Barquist 0 Robert A. Kingsley 4 5 Paul P. Gardner 1 2 3 0 Institute for Molecular Infection Biology, University of Wuerzburg , Wuerzburg , Germany 1 Biomolecular Interaction Centre, University of Canterbury , Christchurch , New Zealand 2 School of Biological Sciences, University of Canterbury , Christchurch , New Zealand 3 Bio-protection Research Centre, University of Canterbury , Christchurch , New Zealand 4 Wellcome Trust Sanger Institute , Hinxton , UK 5 Institute of Food Research , Norwich Research Park, Norwich , UK Motivation: Next generation sequencing technologies have provided us with a wealth of information on genetic variation, but predi cting the functional significance of this variation is a difficult task. While many comparative genomics studies have focused on gene flux and large scale changes, relatively little attention has been paid to quantifying the effects of single nucleotide polymorphisms and indels on protein function, particularly in bacterial genomics. Results: We present a hidden Markov model based approach we call delta-bitscore (DBS) for identifying orthologous proteins that have diverged at the amino acid sequence level in a way that is likely to impact biological function. We benchmark this approach with several widely used datasets and apply it to a proof-of-concept study of orthologous proteomes in an investigation of host adaptation in Salmonella enterica. We highlight the value of the method in identifying functional divergence of genes, and suggest that this tool may be a better approach than the commonly used dN/dS metric for identifying functionally significant genetic changes occurring in recently diverged organisms. Availability and Implementation: A program implementing DBS for pairwise genome comparisons is freely available at: https://github.com/UCanCompBio/deltaBS. Contact: or Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction Genome sequencing technologies allow us to explore the wealth of genetic variation between and within species, and as these technologies advance this data is becoming progressively cheaper, faster and easier to produce (Koren and Phillippy, 2015; Loman and Pallen, 2015; Loman et al., 2012). However, analysis of the functional impact of genetic variation has lagged behind, and has largely focused on the presence or absence of macroscopic features such as particular genes, genomic islands or plasmids. Comparative sequence analyses have become common, and exploration of genetic variation between closely related organisms has provided key insights into bacterial evolution (Barquist and Vogel, 2015; Bryant et al., 2012; Croucher and Didelot, 2014). In particular, the analysis of single nucleotide polymorphisms (SNPs) has been a tremendous boon to the study of bacterial populations, allowing for the construction of phylogenetic trees which provide information on disease transmission and adaptation at scales ranging from global pandemics (Mutreja et al., 2011) to outbreaks within single hospital wards (Harris et al., 2013). Still, the functional analysis of these SNPs, insertions and deletions within protein sequences remains difficult, and often relies on inappropriate tools such as dN/dS. How then can the significance of fine-scale genetic variation be quantified and prioritized for investigation? Recent studies have shown that even single SNPs can have dramatic effects on major phenotypes such as host tropism (Singletary et al., 2016; Viana et al., 2015; Yue et al., 2015). Studies of pathogen adaptation, for example the adaptation of Pseudomonas aeruginosa to the cystic fibrosis lung (Jorth et al., 2015; Marvig et al., 2015) or of Salmonella enterica to immunocompromised populations (Feasey et al., 2012; Okoro et al., 2012, 2015), often result in findings of hundreds to thousands of SNPs and small indels in coding regions. Genome-wide association studies provide one method (Chewapreecha et al., 2014) for interpreting this variation, however the clonal nature of many pathogens can make such study designs difficult if not impossible to pursue and require large sample sizes to be effective. The development of fast and accurate ways to assess the functional impact of variation between strains and prioritize coding variants for followup work is an important step in extracting meaning from comparative analyses. Our strategy uses a profile HMM-based approach. Profile HMMs are probabilistic models of multiple sequence alignments. For each column in the alignment they capture information on the expected frequency of occurrence of different amino acids, insertions, and deletions. We can then use this information to compute a score, which we call delta-bit score (DBS) for reasons explained below, that quantifies the divergence of two protein sequences with (...truncated)


This is a preview of a remote PDF: https://bioinformatics.oxfordjournals.org/content/32/23/3566.full.pdf

Nicole E. Wheeler, Lars Barquist, Robert A. Kingsley, Paul P. Gardner. A profile-based method for identifying functional divergence of orthologous genes in bacterial genomes, Bioinformatics, 2016, pp. 3566-3574, 32/23, DOI: 10.1093/bioinformatics/btw518