HemoFuse: multi-feature fusion based on multi-head cross-attention for identification of hemolytic peptides
www.nature.com/scientificreports
OPEN
HemoFuse: multi-feature fusion
based on multi-head crossattention for identification of
hemolytic peptides
Ya Zhao1, Shengli Zhang1 & Yunyun Liang2
Hemolytic peptides are therapeutic peptides that damage red blood cells. However, therapeutic
peptides used in medical treatment must exhibit low toxicity to red blood cells to achieve the desired
therapeutic effect. Therefore, accurate prediction of the hemolytic activity of therapeutic peptides is
essential for the development of peptide therapies. In this study, a multi-feature cross-fusion model,
HemoFuse, for hemolytic peptide identification is proposed. The feature vectors of peptide sequences
are transformed by word embedding technique and four hand-crafted feature extraction methods. We
apply multi-head cross-attention mechanism to hemolytic peptide identification for the first time. It
captures the interaction between word embedding features and hand-crafted features by calculating
the attention of all positions in them, so that multiple features can be deeply fused. Moreover, we
visualize the features obtained by this module to enhance its interpretability. On the comprehensive
integrated dataset, HemoFuse achieves ideal results, with ACC, SP, SN, MCC, F1, AUC, and AP of
0.7575, 0.8814, 0.5793, 0.4909, 0.6620, 0.8387, and 0.7118, respectively. Compared with HemoDL
proposed by Yang et al., it is 3.32%, 3.89%, 5.93%, 10.6%, 8.17%, 5.88%, and 2.72% higher. Other
ablation experiments also prove that our model is reasonable and efficient. The codes and datasets are
accessible at https://github.com/z11code/Hemo.
Keywords Hemolytic peptides, Transformer, Feature fusion, Multi-head cross-attention mechanism
Therapeutic peptides are widely favored by the medical and pharmaceutical fields due to their advantages
of high permeability and small side effects. However, some of their characteristics have two sides, and their
therapeutic effect will be affected if these characteristics cannot be controlled within a certain range1,2. For
example, the hemolytic activity of therapeutic peptides refers to their ability to bind to red blood cells, allowing
water and other solute molecules to enter red blood cells, thereby increasing the osmotic pressure gradient
inside red blood cells and causing them to swell or even rupture3. It can be seen that therapeutic peptides with
high hemolytic activity have functions such as destroying cancerous cells and targeted delivery of drugs. But for
other therapeutic tasks, the hemolytic activity of therapeutic peptides must be reduced so that them can stably
express a specific function without damaging normal cells4. Therefore, accurate prediction of the hemolytic
activity of therapeutic peptides is helpful to promote the development of peptide drugs. Hemolytik (http://crdd.
osdd.net/raghava/hemolytik/) is a complete database of 3000 experimentally verified hemolytic peptides and
non-hemolytic peptides5. The authors evaluated the hemolytic activity of peptide sequences on 17 different red
blood cells and provided information related to each peptide sequence and its hemolytic activity. DBAASP is
a constantly updated antimicrobial peptide database containing information on the bioactivity and toxicity of
peptide sequences, which can be used for the study of hemolytic activity6. The establishment of these databases
helps us to develop deep learning-based sequencing technologies with low cost and short time consumption
compared to traditional biological experiments.
At present, the feature extraction methods used in identification models based on biological sequences can
be broadly divided into two types: traditional hand-crafted feature extraction methods and feature extraction
methods based on natural language processing technology. Hand-crafted feature extraction methods are designed
by humans, offering low computational complexity and strong interpretability, such as binary encoding7, kme8,
quasi-sequence-order (QSO)9. However, their effectiveness depends on the characteristics of the data itself,
requiring the selection of appropriate descriptors based on the dataset’s properties. Feature extraction methods
1School
of Mathematics and Statistics, Xidian University, Xi’an 710071, P. R. China. 2School of Science, Xi’an
Polytechnic University, Xi’an 710048, P. R. China. email:
Scientific Reports |
(2024) 14:22518
| https://doi.org/10.1038/s41598-024-74326-3
1
www.nature.com/scientificreports/
based on natural language can uncover structures and patterns that are difficult for humans to detect and are less
influenced by the dataset’s inherent characteristics. Nevertheless, their drawback is that they struggle to learn
effective features when the data quality is poor. Word embedding10 is one of the most basic feature extraction
methods in natural language processing, and many models are built on it, such as Bert11,12. In addition, there
are pre-trained language models such as ProtTrans13, evolutionary scale modeling (ESM)14. Of course, there are
also many models that consider both types of methods in order to extract more comprehensive information15.
Machine learning and deep learning algorithms are used for further feature mining and classification. Common
machine learning methods include AdaBoost16, random forest (RF)17, hidden markov model (HMM)18, etc.
While they are simple and easy to understand, they are not well-suited for handling large, high-dimensional
data, resulting in lower model accuracy. In contrast, deep learning, with its ability to automatically learn features,
has demonstrated exceptional performance, far surpassing traditional machine learning algorithms, such as
convolutional neural network (CNN)19, capsule network20, recurrent neural network (RNN)21, transformer22,23.
Machine learning and deep learning each have their own unique advantages, and sometimes work well when
combined24.
Most of the existing identification models for hemolytic peptides use traditional hand-crafted features and
machine learning algorithms, such as HemoPI25, HemoPred26, HLPpred-Fuse27, HAPPENN28, HemoPImod29.
The involved feature extraction methods collect the information of hemolytic peptides from various aspects such
as amino acid composition, peptide composition, physicochemical properties, and atomic descriptors. Classifiers
cover almost all common machine learning algorithms. However, the methods and datasets used by these models
are outdated. Moreover, HemoPImod is unable to identify hemolytic peptide sequences longer than 25 amino
acids. Language model-based methods did not start to appear until 2021. HemoNet30 is the first to employ the
SeqVec language model to capture the contextual features of amino acids, but it struggles to generalize well to
unseen data. With the development of stronger transformer models, AMPDeep first used transformer-based
pretrained model (PROT-ERT-BFD) to represent the features of peptide sequences in 202231. However, its (...truncated)