In silico prioritisation of candidate genes for prokaryotic gene function discovery: an application of phylogenetic profiles
Frank PY Lin
2
Enrico Coiera
2
Ruiting Lan
1
Vitali Sintchenko
0
2
0
Centre for Infectious Diseases and Microbiology, Western Clinical School, University of Sydney
,
Sydney
,
Australia
1
School of Biotechnology and Biomolecular Sciences, University of New South Wales
,
Sydney
,
Australia
2
Centre for Health Informatics, University of New South Wales
,
Sydney
,
Australia
Background: In silico candidate gene prioritisation (CGP) aids the discovery of gene functions by ranking genes according to an objective relevance score. While several CGP methods have been described for identifying human disease genes, corresponding methods for prokaryotic gene function discovery are lacking. Here we present two prokaryotic CGP methods, based on phylogenetic profiles, to assist with this task. Results: Using gene occurrence patterns in sample genomes, we developed two CGP methods (statistical and inductive CGP) to assist with the discovery of bacterial gene functions. Statistical CGP exploits the differences in gene frequency against phenotypic groups, while inductive CGP applies supervised machine learning to identify gene occurrence pattern across genomes. Three rediscovery experiments were designed to evaluate the CGP frameworks. The first experiment attempted to rediscover peptidoglycan genes with 417 published genome sequences. Both CGP methods achieved best areas under receiver operating characteristic curve (AUC) of 0.911 in Escherichia coli K-12 (EC-K12) and 0.978 Streptococcus agalactiae 2603 (SA-2603) genomes, with an average improvement in precision of >3.2-fold and a maximum of >27-fold using statistical CGP. A median AUC of >0.95 could still be achieved with as few as 10 genome examples in each group of genome examples in the rediscovery of the peptidoglycan metabolism genes. In the second experiment, a maximum of 109-fold improvement in precision was achieved in the rediscovery of anaerobic fermentation genes in EC-K12. The last experiment attempted to rediscover genes from 31 metabolic pathways in SA-2603, where 14 pathways achieved AUC >0.9 and 28 pathways achieved AUC >0.8 with the best inductive CGP algorithms. Conclusion: Our results demonstrate that the two CGP methods can assist with the study of functionally uncategorised genomic regions and discovery of bacterial gene-function relationships. Our rediscovery experiments also provide a set of standard tasks against which future methods may be compared.
-
Background
Identifying gene functions is an important task in
biology. The exponential growth of genome sequences
has placed greater importance on the use of
computational approaches for sequence analysis and annotation.
With the development of high-throughput technology,
methods of comparative genomics are increasingly used
to assist with the identification of gene functions [1], as
conventional methods of gene screening using transgenic
organisms are resource intensive and time consuming. In
practice, bench-side researchers frequently encounter
extensive lists of genes that require further pruning and
experimental validation. Accurate prioritisation of
candidate genes, therefore, constitutes a key step in
accelerating the discovery of gene functions.
In silico candidate gene prioritisation (CGP) ranks genes
based upon the features associated with genes and the
function of interest. A variety of gene features have been
suggested for the prioritisation of causal genes in human
diseases, including the co-occurrence of gene name and
disease terminology in biomedical texts [2-5], sharing of
terms in annotation or gene ontology databases [2, 4,
6-9], gene expression in different tissues [2, 4, 6],
protein-protein interactions [4], similarity of gene or
protein sequences [8, 9], presence of genes within a
phenotype or diseases database [10], phylogenetic
relationships [11], or a combination of the above [2,
4]. However, to construct a CGP system for prokaryotes,
different forms of gene features are needed, as current
CGP algorithms are skewed towards eukaryotic genomes
and the systematic curation of annotation or
genotypephenotype databases are less complete than for
eukaryotes. Hundreds of whole genome sequences of bacteria
and thousands of partial genome sequences are available
in public databases, yet prokaryotic genomes display a
higher proportion of genes with unknown function than
eukaryotes [12]. In contrast, several methods for
computational protein function discovery have been
studied, including chromosomal proximity method,
domain fusion analysis, analysis of gene expression
patterns, and phylogenetic profiles [13]. In particular,
the phylogenetic profile method exploits knowledge of
gene occurrences across a range of sequenced genomes
and postulates that genes involved in the same metabolic
pathway are frequently co-inherited. Phylogenetic
profiles have been applied to unsupervised clustering of
proteins to discover their functional linkages [14] and to
discover conserved gene clusters in microbes (with
probabilistic phylogenetic tree models) [15]. Supervised
approaches of phylogenetic profiles have also been
applied to infer protein networks (with canonical
correlation analysis [16]) and predicting protein
functional class in Saccharomyces cerevisiae (with tree-based
kernels [17]), in the discovery of protein localisation in
eukaryotes [18], in functional annotation of genes (by
correlation enrichments [19]). These studies suggested
that the concept of phylogenetic profiles provides a
valuable tool for predicting gene-function linkage. It was
thus hypothesised that such concept can also be
exploited as gene features for prioritising genes
contributing to a particular phenotypic trait of interest, thus
providing a practical and generalisable tool to guide
microbiologists in gene selection.
This paper examines the practical application of the
phylogenetic profile method for gene prioritisation to
investigate its generalisability and applicability on both
simple and complex traits in prokaryotes.
Phylogenetic profiles form an indirect connection between
gene and function in two conceptual steps. The first step
establishes the gene-genome relationship, by examining
the occurrence (presence or absence) of a candidate gene
(or its homolog) in a given genome. The second step
groups genomes according to their known phenotypes. We
investigate two scenarios in which CGP can be useful in
assisting with functional discovery of uncharacterised genes
in prokaryotes. The method of statistical CGP is used when
the occurrence profile can be directly inferred from the
study phenotype, whereas inductive CGP is used when the
profile is obscure but a small number of genes known to
contribute to the study phenotype are available. Candidate
genes are then prioritised by either statistical scoring
functions or supervised machine learning algorithms.
In addition, at present there are no clear benchmarks to
allow comparison between these different approaches to
gene prior (...truncated)