Machine learning for discovering missing or wrong protein function annotations
(2019) 20:485
Nakano et al. BMC Bioinformatics
https://doi.org/10.1186/s12859-019-3060-6
RESEARCH ARTICLE
Open Access
Machine learning for discovering
missing or wrong protein function
annotations
A comparison using updated benchmark datasets
Felipe Kenji Nakano1,2* , Mathias Lietaert3 and Celine Vens1,2
Abstract
Background: A massive amount of proteomic data is generated on a daily basis, nonetheless annotating all
sequences is costly and often unfeasible. As a countermeasure, machine learning methods have been used to
automatically annotate new protein functions. More specifically, many studies have investigated hierarchical
multi-label classification (HMC) methods to predict annotations, using the Functional Catalogue (FunCat) or Gene
Ontology (GO) label hierarchies. Most of these studies employed benchmark datasets created more than a decade
ago, and thus train their models on outdated information. In this work, we provide an updated version of these
datasets. By querying recent versions of FunCat and GO yeast annotations, we provide 24 new datasets in total. We
compare four HMC methods, providing baseline results for the new datasets. Furthermore, we also evaluate whether
the predictive models are able to discover new or wrong annotations, by training them on the old data and
evaluating their results against the most recent information.
Results: The results demonstrated that the method based on predictive clustering trees, Clus-Ensemble, proposed in
2008, achieved superior results compared to more recent methods on the standard evaluation task. For the discovery
of new knowledge, Clus-Ensemble performed better when discovering new annotations in the FunCat taxonomy,
whereas hierarchical multi-label classification with genetic algorithm (HMC-GA), a method based on genetic
algorithms, was overall superior when detecting annotations that were removed. In the GO datasets, Clus-Ensemble
once again had the upper hand when discovering new annotations, HMC-GA performed better for detecting
removed annotations. However, in this evaluation, there were less significant differences among the methods.
Conclusions: The experiments have showed that protein function prediction is a very challenging task which should
be further investigated. We believe that the baseline results associated with the updated datasets provided in this
work should be considered as guidelines for future studies, nonetheless the old versions of the datasets should not be
disregarded since other tasks in machine learning could benefit from them.
Keywords: Hierarchical multi-label classification, Protein function prediction, Benchmark datasets
*Correspondence:
KU Leuven, Campus KULAK - Department of Public Health and Primary Care,
Etienne Sabbelaan 53, 8500, Kortrijk, Belgium
2
ITEC - imec, Etienne Sabbelaan 51, 8500, Kortrijk, Belgium
Full list of author information is available at the end of the article
1
© The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Nakano et al. BMC Bioinformatics
(2019) 20:485
Background
Due to technological advancements, the generation of
proteomic data has increased substantially. However,
annotating all sequences is costly and time-consuming,
making it often unfeasible [1]. As a countermeasure,
recent studies have employed machine learning methods
due to their capacities of automatically predicting protein
functions.
More specifically, protein function prediction is generally modeled as a hierarchical multi-label classification
(HMC) task. HMC is a classification task whose objective is to fit a predictive model f which maps a set of
instances X to a set of hierarchically organized labels Y,
while respecting hierarchy constraints among Y [2, 3].
The hierarchy constraint states that whenever a particular
label yi is predicted, all ancestors labels of yi up to the root
node of the hierarchy must be predicted as well.
In the machine learning literature when proposing a
new method, this method is typically compared to a set of
competitor methods on benchmark datasets. For HMC,
many studies [2–22] utilized the benchmark datasets
proposed in [2]. These datasets are available at https://
dtai.cs.kuleuven.be/clus/hmcdatasets/ and contain protein sequences from the species Saccharomyces cerevisiae
(yeast) whose functions are mapped to either the Functional Catalogue (FunCat) [24] or Gene Ontology (GO)
[23]. The task associated with these datasets is to predict
the functions of a protein, given a a set of descriptive features (e.g., sequence, homology or structural information).
FunCat and GO are different types of hierarchies. In
FunCat (Fig. 1), labels are structured as a tree, meaning
that they can have only a single parent label [24]. The
GO (Fig. 2), however, allows labels to have multiple parent
labels, forming a directed acyclic graph [23]. This complicates the fulfillment of the hierarchy constraint, since
multiple classification paths are allowed throughout the
graph.
These benchmark datasets were introduced to the HMC
community in 2007, and, thus, the functional labels associated with each protein can be considered outdated.
There are two reasons for this. First, functional annotations are updated on a regular basis. Second, as can be
seen in Fig. 3a, there was a drastic increase in the number
of terms throughout the Gene Ontology since the creation
of these datasets (January 2007). A similar observation can
be made for the number of obsolete terms as shown in
Fig. 3b. Accordingly, one of the main goals of this article
is to provide updated versions of these widely used HMC
benchmark datasets to the research community.
Using these new datasets, we present a comparison
among four recent and open-source HMC methods that
can be considered state-of-the-art,thus providing baseline performances as guidelines for future research on this
topic. Finally, having two different versions of the same
Page 2 of 32
datasets provides us with the unique opportunity to be
able to evaluate whether these HMC methods are able
to generalize when learning from data with mislabeled
instances. In particular, we evaluate whether they were
able to predict the correct label in cases where the label
has been altered since 2007. In order to do so, we propose an evaluation procedure where a predictive model
is trained using the data from 2007, but tested with data
from 2018.
The major contributions of this work are the following: i) We provide new (...truncated)