In silico prioritisation of candidate genes for prokaryotic gene function discovery: an application of phylogenetic profiles (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2105-10-86.pdf

In silico prioritisation of candidate genes for prokaryotic gene function discovery: an application of phylogenetic profiles

Frank PY Lin 2 Enrico Coiera 2 Ruiting Lan 1 Vitali Sintchenko 0 2 0 Centre for Infectious Diseases and Microbiology, Western Clinical School, University of Sydney , Sydney , Australia 1 School of Biotechnology and Biomolecular Sciences, University of New South Wales , Sydney , Australia 2 Centre for Health Informatics, University of New South Wales , Sydney , Australia Background: In silico candidate gene prioritisation (CGP) aids the discovery of gene functions by ranking genes according to an objective relevance score. While several CGP methods have been described for identifying human disease genes, corresponding methods for prokaryotic gene function discovery are lacking. Here we present two prokaryotic CGP methods, based on phylogenetic profiles, to assist with this task. Results: Using gene occurrence patterns in sample genomes, we developed two CGP methods (statistical and inductive CGP) to assist with the discovery of bacterial gene functions. Statistical CGP exploits the differences in gene frequency against phenotypic groups, while inductive CGP applies supervised machine learning to identify gene occurrence pattern across genomes. Three rediscovery experiments were designed to evaluate the CGP frameworks. The first experiment attempted to rediscover peptidoglycan genes with 417 published genome sequences. Both CGP methods achieved best areas under receiver operating characteristic curve (AUC) of 0.911 in Escherichia coli K-12 (EC-K12) and 0.978 Streptococcus agalactiae 2603 (SA-2603) genomes, with an average improvement in precision of >3.2-fold and a maximum of >27-fold using statistical CGP. A median AUC of >0.95 could still be achieved with as few as 10 genome examples in each group of genome examples in the rediscovery of the peptidoglycan metabolism genes. In the second experiment, a maximum of 109-fold improvement in precision was achieved in the rediscovery of anaerobic fermentation genes in EC-K12. The last experiment attempted to rediscover genes from 31 metabolic pathways in SA-2603, where 14 pathways achieved AUC >0.9 and 28 pathways achieved AUC >0.8 with the best inductive CGP algorithms. Conclusion: Our results demonstrate that the two CGP methods can assist with the study of functionally uncategorised genomic regions and discovery of bacterial gene-function relationships. Our rediscovery experiments also provide a set of standard tasks against which future methods may be compared. - Background Identifying gene functions is an important task in biology. The exponential growth of genome sequences has placed greater importance on the use of computational approaches for sequence analysis and annotation. With the development of high-throughput technology, methods of comparative genomics are increasingly used to assist with the identification of gene functions [1], as conventional methods of gene screening using transgenic organisms are resource intensive and time consuming. In practice, bench-side researchers frequently encounter extensive lists of genes that require further pruning and experimental validation. Accurate prioritisation of candidate genes, therefore, constitutes a key step in accelerating the discovery of gene functions. In silico candidate gene prioritisation (CGP) ranks genes based upon the features associated with genes and the function of interest. A variety of gene features have been suggested for the prioritisation of causal genes in human diseases, including the co-occurrence of gene name and disease terminology in biomedical texts [2-5], sharing of terms in annotation or gene ontology databases [2, 4, 6-9], gene expression in different tissues [2, 4, 6], protein-protein interactions [4], similarity of gene or protein sequences [8, 9], presence of genes within a phenotype or diseases database [10], phylogenetic relationships [11], or a combination of the above [2, 4]. However, to construct a CGP system for prokaryotes, different forms of gene features are needed, as current CGP algorithms are skewed towards eukaryotic genomes and the systematic curation of annotation or genotypephenotype databases are less complete than for eukaryotes. Hundreds of whole genome sequences of bacteria and thousands of partial genome sequences are available in public databases, yet prokaryotic genomes display a higher proportion of genes with unknown function than eukaryotes [12]. In contrast, several methods for computational protein function discovery have been studied, including chromosomal proximity method, domain fusion analysis, analysis of gene expression patterns, and phylogenetic profiles [13]. In particular, the phylogenetic profile method exploits knowledge of gene occurrences across a range of sequenced genomes and postulates that genes involved in the same metabolic pathway are frequently co-inherited. Phylogenetic profiles have been applied to unsupervised clustering of proteins to discover their functional linkages [14] and to discover conserved gene clusters in microbes (with probabilistic phylogenetic tree models) [15]. Supervised approaches of phylogenetic profiles have also been applied to infer protein networks (with canonical correlation analysis [16]) and predicting protein functional class in Saccharomyces cerevisiae (with tree-based kernels [17]), in the discovery of protein localisation in eukaryotes [18], in functional annotation of genes (by correlation enrichments [19]). These studies suggested that the concept of phylogenetic profiles provides a valuable tool for predicting gene-function linkage. It was thus hypothesised that such concept can also be exploited as gene features for prioritising genes contributing to a particular phenotypic trait of interest, thus providing a practical and generalisable tool to guide microbiologists in gene selection. This paper examines the practical application of the phylogenetic profile method for gene prioritisation to investigate its generalisability and applicability on both simple and complex traits in prokaryotes. Phylogenetic profiles form an indirect connection between gene and function in two conceptual steps. The first step establishes the gene-genome relationship, by examining the occurrence (presence or absence) of a candidate gene (or its homolog) in a given genome. The second step groups genomes according to their known phenotypes. We investigate two scenarios in which CGP can be useful in assisting with functional discovery of uncharacterised genes in prokaryotes. The method of statistical CGP is used when the occurrence profile can be directly inferred from the study phenotype, whereas inductive CGP is used when the profile is obscure but a small number of genes known to contribute to the study phenotype are available. Candidate genes are then prioritised by either statistical scoring functions or supervised machine learning algorithms. In addition, at present there are no clear benchmarks to allow comparison between these different approaches to gene prior (...truncated)