Machine learning for discovering missing or wrong protein function annotations (pdf)

Article PDF cannot be displayed. You can download it here:

https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-019-3060-6

Machine learning for discovering missing or wrong protein function annotations

(2019) 20:485 Nakano et al. BMC Bioinformatics https://doi.org/10.1186/s12859-019-3060-6 RESEARCH ARTICLE Open Access Machine learning for discovering missing or wrong protein function annotations A comparison using updated benchmark datasets Felipe Kenji Nakano1,2* , Mathias Lietaert3 and Celine Vens1,2 Abstract Background: A massive amount of proteomic data is generated on a daily basis, nonetheless annotating all sequences is costly and often unfeasible. As a countermeasure, machine learning methods have been used to automatically annotate new protein functions. More specifically, many studies have investigated hierarchical multi-label classification (HMC) methods to predict annotations, using the Functional Catalogue (FunCat) or Gene Ontology (GO) label hierarchies. Most of these studies employed benchmark datasets created more than a decade ago, and thus train their models on outdated information. In this work, we provide an updated version of these datasets. By querying recent versions of FunCat and GO yeast annotations, we provide 24 new datasets in total. We compare four HMC methods, providing baseline results for the new datasets. Furthermore, we also evaluate whether the predictive models are able to discover new or wrong annotations, by training them on the old data and evaluating their results against the most recent information. Results: The results demonstrated that the method based on predictive clustering trees, Clus-Ensemble, proposed in 2008, achieved superior results compared to more recent methods on the standard evaluation task. For the discovery of new knowledge, Clus-Ensemble performed better when discovering new annotations in the FunCat taxonomy, whereas hierarchical multi-label classification with genetic algorithm (HMC-GA), a method based on genetic algorithms, was overall superior when detecting annotations that were removed. In the GO datasets, Clus-Ensemble once again had the upper hand when discovering new annotations, HMC-GA performed better for detecting removed annotations. However, in this evaluation, there were less significant differences among the methods. Conclusions: The experiments have showed that protein function prediction is a very challenging task which should be further investigated. We believe that the baseline results associated with the updated datasets provided in this work should be considered as guidelines for future studies, nonetheless the old versions of the datasets should not be disregarded since other tasks in machine learning could benefit from them. Keywords: Hierarchical multi-label classification, Protein function prediction, Benchmark datasets *Correspondence: KU Leuven, Campus KULAK - Department of Public Health and Primary Care, Etienne Sabbelaan 53, 8500, Kortrijk, Belgium 2 ITEC - imec, Etienne Sabbelaan 51, 8500, Kortrijk, Belgium Full list of author information is available at the end of the article 1 © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Nakano et al. BMC Bioinformatics (2019) 20:485 Background Due to technological advancements, the generation of proteomic data has increased substantially. However, annotating all sequences is costly and time-consuming, making it often unfeasible [1]. As a countermeasure, recent studies have employed machine learning methods due to their capacities of automatically predicting protein functions. More specifically, protein function prediction is generally modeled as a hierarchical multi-label classification (HMC) task. HMC is a classification task whose objective is to fit a predictive model f which maps a set of instances X to a set of hierarchically organized labels Y, while respecting hierarchy constraints among Y [2, 3]. The hierarchy constraint states that whenever a particular label yi is predicted, all ancestors labels of yi up to the root node of the hierarchy must be predicted as well. In the machine learning literature when proposing a new method, this method is typically compared to a set of competitor methods on benchmark datasets. For HMC, many studies [2–22] utilized the benchmark datasets proposed in [2]. These datasets are available at https:// dtai.cs.kuleuven.be/clus/hmcdatasets/ and contain protein sequences from the species Saccharomyces cerevisiae (yeast) whose functions are mapped to either the Functional Catalogue (FunCat) [24] or Gene Ontology (GO) [23]. The task associated with these datasets is to predict the functions of a protein, given a a set of descriptive features (e.g., sequence, homology or structural information). FunCat and GO are different types of hierarchies. In FunCat (Fig. 1), labels are structured as a tree, meaning that they can have only a single parent label [24]. The GO (Fig. 2), however, allows labels to have multiple parent labels, forming a directed acyclic graph [23]. This complicates the fulfillment of the hierarchy constraint, since multiple classification paths are allowed throughout the graph. These benchmark datasets were introduced to the HMC community in 2007, and, thus, the functional labels associated with each protein can be considered outdated. There are two reasons for this. First, functional annotations are updated on a regular basis. Second, as can be seen in Fig. 3a, there was a drastic increase in the number of terms throughout the Gene Ontology since the creation of these datasets (January 2007). A similar observation can be made for the number of obsolete terms as shown in Fig. 3b. Accordingly, one of the main goals of this article is to provide updated versions of these widely used HMC benchmark datasets to the research community. Using these new datasets, we present a comparison among four recent and open-source HMC methods that can be considered state-of-the-art,thus providing baseline performances as guidelines for future research on this topic. Finally, having two different versions of the same Page 2 of 32 datasets provides us with the unique opportunity to be able to evaluate whether these HMC methods are able to generalize when learning from data with mislabeled instances. In particular, we evaluate whether they were able to predict the correct label in cases where the label has been altered since 2007. In order to do so, we propose an evaluation procedure where a predictive model is trained using the data from 2007, but tested with data from 2018. The major contributions of this work are the following: i) We provide new (...truncated)