PDAUG: a Galaxy based toolset for peptide library analysis, visualization, and machine learning modeling
(2022) 23:197
Joshi and Blankenberg BMC Bioinformatics
https://doi.org/10.1186/s12859-022-04727-6
BMC Bioinformatics
Open Access
SOFTWARE
PDAUG: a Galaxy based toolset for peptide
library analysis, visualization, and machine
learning modeling
Jayadev Joshi1 and Daniel Blankenberg1,2*
*Correspondence:
1
Genomic Medicine Institute,
Lerner Research Institute,
Cleveland Clinic, Cleveland,
OH, USA
Full list of author information
is available at the end of the
article
Abstract
Background: Computational methods based on initial screening and prediction of
peptides for desired functions have proven to be effective alternatives to lengthy and
expensive biochemical experimental methods traditionally utilized in peptide research,
thus saving time and effort. However, for many researchers, the lack of expertise in utilizing programming libraries, access to computational resources, and flexible pipelines
are big hurdles to adopting these advanced methods.
Results: To address the above mentioned barriers, we have implemented the peptide
design and analysis under Galaxy (PDAUG) package, a Galaxy-based Python powered
collection of tools, workflows, and datasets for rapid in-silico peptide library analysis.
In contrast to existing methods like standard programming libraries or rigid singlefunction web-based tools, PDAUG offers an integrated GUI-based toolset, providing flexibility to build and distribute reproducible pipelines and workflows without
programming expertise. Finally, we demonstrate the usability of PDAUG in predicting
anticancer properties of peptides using four different feature sets and assess the suitability of various ML algorithms.
Conclusion: PDAUG offers tools for peptide library generation, data visualization,
built-in and public database peptide sequence retrieval, peptide feature calculation,
and machine learning (ML) modeling. Additionally, this toolset facilitates researchers
to combine PDAUG with hundreds of compatible existing Galaxy tools for limitless
analytic strategies.
Introduction
Interest in peptides-related research has been gaining in popularity over the last several decades [35]. A large number of naturally occurring peptides (over 7000) with
potentially important roles in human physiology have been identified. Currently,
more than 140 peptide therapeutics are in different stages of clinical trials [17]. In
view of their integral importance in a number of signal transduction pathways, they
are ideal candidates for functioning as drugs, especially as anticancer or antimicrobial
agents [1]. Usually, peptides are naturally occurring molecules that are synthesized by
cellular processes and adopt alternative conformations according to their biological
© The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits
use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original
author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third
party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or
exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://
creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publi
cdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Joshi and Blankenberg BMC Bioinformatics
(2022) 23:197
functions [35]. Peptides can either act as natural ligands in the form of cofactors,
coenzymes, and hormones, or directly interact with macromolecules including proteins, RNA, or DNA [15]. The research underlying the design of therapeutic peptides,
such as peptide-based drugs and vaccines, demands intense effort and assets for
establishing their pharmacokinetic and pharmacodynamic properties such as serum
stability, bioavailability toxicity, etc. [7, 44]. Peptide-based vaccines have emerged as a
powerful approach to counter infectious diseases and cancer [37]. Characterization of
peptides that bind to specific major histocompatibility complex (MHC) molecules is
therefore of great importance for peptide-based vaccines. However, in comparison to
expensive and lengthy biochemical experiments, bioinformatics methods for predicting MHC binding peptides have been very popular in recent years [24, 28, 45]. Various computational approaches have been shown to offer the best cost–benefit ratio
across translational research areas [50, 59, 60]. Leveraging in-silico approaches to
uncover peptides with desired pharmacological action can be expected to significantly
lower the cost and time required to establish a drug or a vaccine candidate [34]. In
fact, computational predictions of peptides with desired functions have been providing effective alternatives to traditional methods in peptide research, thus saving time
and effort [5, 22, 25, 33, 39, 52]. The concept of prioritizing sequence-based properties of a protein sequence as a function of sequence-derived features is not new [29].
Over the past decade, approaches based on physicochemical, compositional properties, k-mer counting, etc. have been proposed [10, 51, 62]. With the rise of computational power, feature-based methods have evolved substantially, expanding into the
analysis of 3D structure level of biomolecules [23]. However, necessary programming
and mathematics expertise, as well as limitations in hardware resources, are among
the core challenges associated with utilizing programming-based resources [30, 49].
Web-based data analysis platforms, such as Galaxy [2, 19, 26], have been providing a user-friendly solution to enable researchers to include advanced data analysis
methods in their work. Galaxy is an open-source, web-based platform for accessible,
reproducible, and transparent computational research. It provides a wealth of computational tools, workflows, and training materials for advance data visualization and
analysis.
In this paper, we present PDAUG, a Galaxy tool suite that includes 24 different tools
for the analysis of peptide libraries. The main objective of this paper is to provide a set of
user-friendly tools for peptide library generation, visualization, machine learning (ML)
modeling and analysis. PDAUG provides user-friendly tools in various categories including peptide library generation, feature analysis, data visualization and plotting, ML modeling, and dataset retrieval. These modular command-line tools leverage the Galaxy
platform to provide an interactive graphical interface for each tool as well as a (...truncated)