Unsupervised encoding selection through ensemble pruning for biomedical classification
(2023) 16:10
Spänig et al. BioData Mining
https://doi.org/10.1186/s13040-022-00317-7
BioData Mining
Open Access
METHODOLOGY
Unsupervised encoding selection
through ensemble pruning for biomedical
classification
Sebastian Spänig, Alexander Michel and Dominik Heider*
*Correspondence:
Data Science in Biomedicine,
Department of Mathematics
and Computer Science,
University of Marburg, Marburg,
Germany
Abstract
Background: Owing to the rising levels of multi-resistant pathogens, antimicrobial
peptides, an alternative strategy to classic antibiotics, got more attention. A crucial
part is thereby the costly identification and validation. With the ever-growing amount
of annotated peptides, researchers leverage artificial intelligence to circumvent the
cumbersome, wet-lab-based identification and automate the detection of promising
candidates. However, the prediction of a peptide’s function is not limited to antimicrobial efficiency. To date, multiple studies successfully classified additional properties,
e.g., antiviral or cell-penetrating effects. In this light, ensemble classifiers are employed
aiming to further improve the prediction. Although we recently presented a workflow
to significantly diminish the initial encoding choice, an entire unsupervised encoding
selection, considering various machine learning models, is still lacking.
Results: We developed a workflow, automatically selecting encodings and generating
classifier ensembles by employing sophisticated pruning methods. We observed that
the Pareto frontier pruning is a good method to create encoding ensembles for the
datasets at hand. In addition, encodings combined with the Decision Tree classifier as
the base model are often superior. However, our results also demonstrate that none of
the ensemble building techniques is outstanding for all datasets.
Conclusion: The workflow conducts multiple pruning methods to evaluate ensemble
classifiers composed from a wide range of peptide encodings and base models. Consequently, researchers can use the workflow for unsupervised encoding selection and
ensemble creation. Ultimately, the extensible workflow can be used as a plugin for the
PEPTIDE REACToR, further establishing it as a versatile tool in the domain.
Keywords: Biomedical classification, Antimicrobial peptides, Encodings, Machine
learning, Ensemble learning
Background
Multi-resistant pathogens are a major threat for modern society [1]. In the last decades, a rising number of bacterial species developed mechanisms to elude efficiency to
widely used antibiotics [1]. The importance of developing and implementing alternative
strategies is further underpinned by a recent study, which detected a certain baseline
© The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits
use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original
author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third
party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or
exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://
creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publi
cdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Spänig et al. BioData Mining
(2023) 16:10
resistance in European freshwater lakes [2]. The study confirmed resistance specifically
against four critical drug classes in human and veterinary health in freshwater, which is
typically considered as a pathogen-free environment [2]. Moreover, already concerning
levels of antibiotic resistance in Indian and Chinese lakes emphasize the requirement
of alternative biocides [3, 4]. One promising approach to replace or even support common antibiotics refers to the deployment of peptides with antimicrobial efficiency [5].
However, identifying and validating active peptides requires intensive, hence, costly and
time-consuming wet-lab work. Thus, in the pre-artificial intelligence (AI) era, the manual classification and verification of antimicrobial peptides (AMPs) engaged researchers. Although the in vitro confirmation of activity is still necessary, the application of AI,
i.e., in particular machine learning (ML) algorithms, simplifies the identification process
drastically and pushed several AMPs to the second or third phase of clinical trials [6]. In
addition, online databases provide access to thousands of annotated sequences and pave
the way automatic peptide design and classification [7]. For instance, Chung et al. (2019)
developed a method, which demonstrated good performance on classifying AMPs
using a two-step approach, which first predicts efficiency, and afterward the precise
target activity [8]. Another study employed a variational autoencoder to encode AMPs,
mapped the probability of being active to a latent space, and predicted novel AMPs [9].
Fingerhut et al. (2020) introduced an algorithm to detect AMPs from genomic data [10].
For more information on computational approaches for AMP classification, we refer to
the recent review of Aronica et al. (2021) [11].
However, the prediction of amino acid sequence features is not limited to AMPs. In
the literature, one can find various applications, e.g., in oncology for predicting anticancer peptides [12], in pharmacology for the discovery and application of cell-penetrating
peptides as transporters for molecules [13], or in immunotherapy, for classifying of proor antiinflammatory peptides [14, 15]. Other applications include antiviral peptides [16],
or peptides with hemolytic [17] or neuro transmitting activity [18].
Unequivocally, the success of ML methods for the prediction of AMPs was enabled by
the development and advances of peptide encodings. Encodings are algorithms mapping
the amino acid sequences of different lengths to numerical vectors of an equal length,
hence, fulfilling the requirement of many ML algorithms [19]. Moreover, peptides or
proteins can be described by their primary structure, i.e., the amino acid sequence, and
the aggregation in higher dimensions, denoted as the secondary or tertiary structure.
Encodings derived from the primary structure are known as sequence-, and encodings
describing a higher-order folding are structure-based encodings. To date, a large number of sequence- and structure-based encodings have been introduced and employed in
various studies [19]. A significant amount of encodings has been recently (...truncated)