A profile-based method for identifying functional divergence of orthologous genes in bacterial genomes
Bioinformatics
A profile-based method for identifying functional divergence of orthologous genes in bacterial genomes
Nicole E. Wheeler 1 2
Lars Barquist 0
Robert A. Kingsley 4 5
Paul P. Gardner 1 2 3
0 Institute for Molecular Infection Biology, University of Wuerzburg , Wuerzburg , Germany
1 Biomolecular Interaction Centre, University of Canterbury , Christchurch , New Zealand
2 School of Biological Sciences, University of Canterbury , Christchurch , New Zealand
3 Bio-protection Research Centre, University of Canterbury , Christchurch , New Zealand
4 Wellcome Trust Sanger Institute , Hinxton , UK
5 Institute of Food Research , Norwich Research Park, Norwich , UK
Motivation: Next generation sequencing technologies have provided us with a wealth of information on genetic variation, but predi cting the functional significance of this variation is a difficult task. While many comparative genomics studies have focused on gene flux and large scale changes, relatively little attention has been paid to quantifying the effects of single nucleotide polymorphisms and indels on protein function, particularly in bacterial genomics. Results: We present a hidden Markov model based approach we call delta-bitscore (DBS) for identifying orthologous proteins that have diverged at the amino acid sequence level in a way that is likely to impact biological function. We benchmark this approach with several widely used datasets and apply it to a proof-of-concept study of orthologous proteomes in an investigation of host adaptation in Salmonella enterica. We highlight the value of the method in identifying functional divergence of genes, and suggest that this tool may be a better approach than the commonly used dN/dS metric for identifying functionally significant genetic changes occurring in recently diverged organisms. Availability and Implementation: A program implementing DBS for pairwise genome comparisons is freely available at: https://github.com/UCanCompBio/deltaBS. Contact: or Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction
Genome sequencing technologies allow us to explore the wealth of
genetic variation between and within species, and as these
technologies advance this data is becoming progressively cheaper, faster and
easier to produce (Koren and Phillippy, 2015; Loman and Pallen,
2015; Loman et al., 2012). However, analysis of the functional
impact of genetic variation has lagged behind, and has largely
focused on the presence or absence of macroscopic features such
as particular genes, genomic islands or plasmids. Comparative
sequence analyses have become common, and exploration of
genetic variation between closely related organisms has provided
key insights into bacterial evolution (Barquist and Vogel, 2015;
Bryant et al., 2012; Croucher and Didelot, 2014). In particular, the
analysis of single nucleotide polymorphisms (SNPs) has been a
tremendous boon to the study of bacterial populations, allowing for
the construction of phylogenetic trees which provide information on
disease transmission and adaptation at scales ranging from global
pandemics (Mutreja et al., 2011) to outbreaks within single hospital
wards (Harris et al., 2013). Still, the functional analysis of these
SNPs, insertions and deletions within protein sequences remains
difficult, and often relies on inappropriate tools such as dN/dS.
How then can the significance of fine-scale genetic variation be
quantified and prioritized for investigation? Recent studies have
shown that even single SNPs can have dramatic effects on major
phenotypes such as host tropism (Singletary et al., 2016; Viana
et al., 2015; Yue et al., 2015). Studies of pathogen adaptation, for
example the adaptation of Pseudomonas aeruginosa to the cystic
fibrosis lung (Jorth et al., 2015; Marvig et al., 2015) or of Salmonella
enterica to immunocompromised populations (Feasey et al., 2012;
Okoro et al., 2012, 2015), often result in findings of hundreds to
thousands of SNPs and small indels in coding regions. Genome-wide
association studies provide one method (Chewapreecha et al., 2014)
for interpreting this variation, however the clonal nature of many
pathogens can make such study designs difficult if not impossible to
pursue and require large sample sizes to be effective. The
development of fast and accurate ways to assess the functional impact of
variation between strains and prioritize coding variants for
followup work is an important step in extracting meaning from
comparative analyses.
Our strategy uses a profile HMM-based approach. Profile
HMMs are probabilistic models of multiple sequence alignments.
For each column in the alignment they capture information on the
expected frequency of occurrence of different amino acids,
insertions, and deletions. We can then use this information to compute a
score, which we call delta-bit score (DBS) for reasons explained
below, that quantifies the divergence of two protein sequences with
(...truncated)