Prediction of Human Phenotype Ontology terms by means of hierarchical ensemble methods (pdf)

Article PDF cannot be displayed. You can download it here:

https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-017-1854-y

Prediction of Human Phenotype Ontology terms by means of hierarchical ensemble methods

Notaro et al. BMC Bioinformatics (2017) 18:449 DOI 10.1186/s12859-017-1854-y RESEARCH ARTICLE Open Access Prediction of Human Phenotype Ontology terms by means of hierarchical ensemble methods Marco Notaro1 , Max Schubach2,6 , Peter N. Robinson2,3,4,5 and Giorgio Valentini1* Abstract Background: The prediction of human gene–abnormal phenotype associations is a fundamental step toward the discovery of novel genes associated with human disorders, especially when no genes are known to be associated with a specific disease. In this context the Human Phenotype Ontology (HPO) provides a standard categorization of the abnormalities associated with human diseases. While the problem of the prediction of gene–disease associations has been widely investigated, the related problem of gene–phenotypic feature (i.e., HPO term) associations has been largely overlooked, even if for most human genes no HPO term associations are known and despite the increasing application of the HPO to relevant medical problems. Moreover most of the methods proposed in literature are not able to capture the hierarchical relationships between HPO terms, thus resulting in inconsistent and relatively inaccurate predictions. Results: We present two hierarchical ensemble methods that we formally prove to provide biologically consistent predictions according to the hierarchical structure of the HPO. The modular structure of the proposed methods, that consists in a “flat” learning first step and a hierarchical combination of the predictions in the second step, allows the predictions of virtually any flat learning method to be enhanced. The experimental results show that hierarchical ensemble methods are able to predict novel associations between genes and abnormal phenotypes with results that are competitive with state-of-the-art algorithms and with a significant reduction of the computational complexity. Conclusions: Hierarchical ensembles are efficient computational methods that guarantee biologically meaningful predictions that obey the true path rule, and can be used as a tool to improve and make consistent the HPO terms predictions starting from virtually any flat learning method. The implementation of the proposed methods is available as an R package from the CRAN repository. Keywords: Human Phenotype Ontology, Hierarchical multi-label classification, Hierarchical ensemble methods, Gene-Abnormal phenotype association, Human Phenotype Ontology term prediction, Phenotype gene prioritization Background In contrast to its general meaning that usually refers to the traits or characteristics of an organism, in medical contexts, the word “phenotype” is defined as a deviation from normal morphology, physiology, or behavior [1]. The analysis of phenotype is essential for understanding the pathophysiology of cellular networks and plays a key role in medical research and in the mapping of disease genes *Correspondence: Anacleto Lab - Dipartimento di Informatica, Universitá degli Studi di Milano, Via Comelico 39, 20135 Milan, Italy Full list of author information is available at the end of the article 1 [2, 3]. The Human Phenotype Ontology (HPO) project [4] provides a standard categorization of the human abnormal phenotypes and of their semantic relationships. It is worth noting that each HPO term does not represent a disease, but rather denotes individual signs or symptoms or other clinical abnormalities that characterize a disease. The HPO is currently developed using the medical literature, and OMIM [5], Orphanet [6] and DECIPHER [7] databases, and contains approximately 11,000 terms and over 115,000 annotations to hereditary diseases. The HPO is structured as a direct acyclic graph (DAG), where © The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Notaro et al. BMC Bioinformatics (2017) 18:449 more general terms are found on the top levels of hierarchy and the term specificity increases moving towards the lower levels of hierarchy, i.e. from root to leaves. As a consequence, differently from tree-structured taxonomies such as FunCat [8], each HPO term may have more than one parent. The HPO is governed by true-pathrule (also known as annotation propagation rule) [2]: if a gene is annotated with a given functional term, then it is annotated with all the “parent” terms, and with all its ancestors in a recursive way. On the contrary if a gene is not annotated to a term, it cannot be annotated to its offspring. While the problem of the prediction of gene–disease associations has been widely investigated [9], the related problem of gene–HPO term prediction has been only considered in a few studies [10], despite the fact that no HPO term associations are known for most human genes, and the quickly growing application of the HPO to relevant medical problems [11, 12]. “Flat” classification methods have been applied to the prediction of gene-HPO term associations [13]. Unfortunately these methods can introduce major inconsistencies in the classification, because labels are independently predicted without taking into account the hierarchical relationships within the ontology [14]. For example, if we use the HPO to predict gene-phenotype relations, a flat learner can associate the HPO term “Hyperplasia of metatarsal bones” to a gene. But it might not associate the parent term “Abnormality of the metatarsal bones”, thus leading to an inconsistent prediction. In addition flat methods do not exploit a priori knowledge about the topology of the ontology, which may result in a reduction in the prediction accuracy. To properly handle the hierarchical relationships between terms that characterize the HPO, we can apply two main classes of structured output methods, i.e. methods able to exploit in the learning process the hierarchical structure of terms [15]. The first category of methods exploits joint input and output kernelization techniques based on large margin methods for structured and interdependent output variables [16, 17]. The second general class of structured output methods is based on ensembles of learning machines able to exploit the hierarchical relationship between classes; theoretical studies [18], as well as applications in several domains [19] showed the effectiveness of this approach. Both these classes of methods have been applied to several bioinformatics problems, ranging from enzyme function prediction [17, 20] to the hierarchical prediction of Gene O (...truncated)