Prediction of Human Phenotype Ontology terms by means of hierarchical ensemble methods
Notaro et al. BMC Bioinformatics (2017) 18:449
DOI 10.1186/s12859-017-1854-y
RESEARCH ARTICLE
Open Access
Prediction of Human Phenotype
Ontology terms by means of hierarchical
ensemble methods
Marco Notaro1 , Max Schubach2,6 , Peter N. Robinson2,3,4,5 and Giorgio Valentini1*
Abstract
Background: The prediction of human gene–abnormal phenotype associations is a fundamental step toward the
discovery of novel genes associated with human disorders, especially when no genes are known to be associated
with a specific disease. In this context the Human Phenotype Ontology (HPO) provides a standard categorization of
the abnormalities associated with human diseases. While the problem of the prediction of gene–disease associations
has been widely investigated, the related problem of gene–phenotypic feature (i.e., HPO term) associations has been
largely overlooked, even if for most human genes no HPO term associations are known and despite the increasing
application of the HPO to relevant medical problems. Moreover most of the methods proposed in literature are not
able to capture the hierarchical relationships between HPO terms, thus resulting in inconsistent and relatively
inaccurate predictions.
Results: We present two hierarchical ensemble methods that we formally prove to provide biologically consistent
predictions according to the hierarchical structure of the HPO. The modular structure of the proposed methods, that
consists in a “flat” learning first step and a hierarchical combination of the predictions in the second step, allows the
predictions of virtually any flat learning method to be enhanced. The experimental results show that hierarchical
ensemble methods are able to predict novel associations between genes and abnormal phenotypes with results that
are competitive with state-of-the-art algorithms and with a significant reduction of the computational complexity.
Conclusions: Hierarchical ensembles are efficient computational methods that guarantee biologically meaningful
predictions that obey the true path rule, and can be used as a tool to improve and make consistent the HPO terms
predictions starting from virtually any flat learning method. The implementation of the proposed methods is available
as an R package from the CRAN repository.
Keywords: Human Phenotype Ontology, Hierarchical multi-label classification, Hierarchical ensemble methods,
Gene-Abnormal phenotype association, Human Phenotype Ontology term prediction, Phenotype gene prioritization
Background
In contrast to its general meaning that usually refers to
the traits or characteristics of an organism, in medical
contexts, the word “phenotype” is defined as a deviation
from normal morphology, physiology, or behavior [1]. The
analysis of phenotype is essential for understanding the
pathophysiology of cellular networks and plays a key role
in medical research and in the mapping of disease genes
*Correspondence:
Anacleto Lab - Dipartimento di Informatica, Universitá degli Studi di Milano,
Via Comelico 39, 20135 Milan, Italy
Full list of author information is available at the end of the article
1
[2, 3]. The Human Phenotype Ontology (HPO) project [4]
provides a standard categorization of the human abnormal phenotypes and of their semantic relationships. It is
worth noting that each HPO term does not represent a
disease, but rather denotes individual signs or symptoms
or other clinical abnormalities that characterize a disease.
The HPO is currently developed using the medical literature, and OMIM [5], Orphanet [6] and DECIPHER
[7] databases, and contains approximately 11,000 terms
and over 115,000 annotations to hereditary diseases. The
HPO is structured as a direct acyclic graph (DAG), where
© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Notaro et al. BMC Bioinformatics (2017) 18:449
more general terms are found on the top levels of hierarchy and the term specificity increases moving towards
the lower levels of hierarchy, i.e. from root to leaves.
As a consequence, differently from tree-structured taxonomies such as FunCat [8], each HPO term may have
more than one parent. The HPO is governed by true-pathrule (also known as annotation propagation rule) [2]: if
a gene is annotated with a given functional term, then it
is annotated with all the “parent” terms, and with all its
ancestors in a recursive way. On the contrary if a gene
is not annotated to a term, it cannot be annotated to its
offspring.
While the problem of the prediction of gene–disease
associations has been widely investigated [9], the related
problem of gene–HPO term prediction has been only considered in a few studies [10], despite the fact that no HPO
term associations are known for most human genes, and
the quickly growing application of the HPO to relevant
medical problems [11, 12].
“Flat” classification methods have been applied to
the prediction of gene-HPO term associations [13].
Unfortunately these methods can introduce major
inconsistencies in the classification, because labels are
independently predicted without taking into account the
hierarchical relationships within the ontology [14]. For
example, if we use the HPO to predict gene-phenotype
relations, a flat learner can associate the HPO term
“Hyperplasia of metatarsal bones” to a gene. But it
might not associate the parent term “Abnormality of the
metatarsal bones”, thus leading to an inconsistent prediction. In addition flat methods do not exploit a priori
knowledge about the topology of the ontology, which may
result in a reduction in the prediction accuracy.
To properly handle the hierarchical relationships
between terms that characterize the HPO, we can apply
two main classes of structured output methods, i.e. methods able to exploit in the learning process the hierarchical
structure of terms [15]. The first category of methods
exploits joint input and output kernelization techniques
based on large margin methods for structured and interdependent output variables [16, 17]. The second general
class of structured output methods is based on ensembles of learning machines able to exploit the hierarchical
relationship between classes; theoretical studies [18], as
well as applications in several domains [19] showed the
effectiveness of this approach. Both these classes of methods have been applied to several bioinformatics problems,
ranging from enzyme function prediction [17, 20] to the
hierarchical prediction of Gene O (...truncated)