HemoFuse: multi-feature fusion based on multi-head cross-attention for identification of hemolytic peptides (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41598-024-74326-3.pdf

HemoFuse: multi-feature fusion based on multi-head cross-attention for identification of hemolytic peptides

www.nature.com/scientificreports OPEN HemoFuse: multi-feature fusion based on multi-head crossattention for identification of hemolytic peptides Ya Zhao1, Shengli Zhang1 & Yunyun Liang2 Hemolytic peptides are therapeutic peptides that damage red blood cells. However, therapeutic peptides used in medical treatment must exhibit low toxicity to red blood cells to achieve the desired therapeutic effect. Therefore, accurate prediction of the hemolytic activity of therapeutic peptides is essential for the development of peptide therapies. In this study, a multi-feature cross-fusion model, HemoFuse, for hemolytic peptide identification is proposed. The feature vectors of peptide sequences are transformed by word embedding technique and four hand-crafted feature extraction methods. We apply multi-head cross-attention mechanism to hemolytic peptide identification for the first time. It captures the interaction between word embedding features and hand-crafted features by calculating the attention of all positions in them, so that multiple features can be deeply fused. Moreover, we visualize the features obtained by this module to enhance its interpretability. On the comprehensive integrated dataset, HemoFuse achieves ideal results, with ACC, SP, SN, MCC, F1, AUC, and AP of 0.7575, 0.8814, 0.5793, 0.4909, 0.6620, 0.8387, and 0.7118, respectively. Compared with HemoDL proposed by Yang et al., it is 3.32%, 3.89%, 5.93%, 10.6%, 8.17%, 5.88%, and 2.72% higher. Other ablation experiments also prove that our model is reasonable and efficient. The codes and datasets are accessible at https://github.com/z11code/Hemo. Keywords Hemolytic peptides, Transformer, Feature fusion, Multi-head cross-attention mechanism Therapeutic peptides are widely favored by the medical and pharmaceutical fields due to their advantages of high permeability and small side effects. However, some of their characteristics have two sides, and their therapeutic effect will be affected if these characteristics cannot be controlled within a certain range1,2. For example, the hemolytic activity of therapeutic peptides refers to their ability to bind to red blood cells, allowing water and other solute molecules to enter red blood cells, thereby increasing the osmotic pressure gradient inside red blood cells and causing them to swell or even rupture3. It can be seen that therapeutic peptides with high hemolytic activity have functions such as destroying cancerous cells and targeted delivery of drugs. But for other therapeutic tasks, the hemolytic activity of therapeutic peptides must be reduced so that them can stably express a specific function without damaging normal cells4. Therefore, accurate prediction of the hemolytic activity of therapeutic peptides is helpful to promote the development of peptide drugs. Hemolytik (http://crdd. osdd.net/raghava/hemolytik/) is a complete database of 3000 experimentally verified hemolytic peptides and non-hemolytic peptides5. The authors evaluated the hemolytic activity of peptide sequences on 17 different red blood cells and provided information related to each peptide sequence and its hemolytic activity. DBAASP is a constantly updated antimicrobial peptide database containing information on the bioactivity and toxicity of peptide sequences, which can be used for the study of hemolytic activity6. The establishment of these databases helps us to develop deep learning-based sequencing technologies with low cost and short time consumption compared to traditional biological experiments. At present, the feature extraction methods used in identification models based on biological sequences can be broadly divided into two types: traditional hand-crafted feature extraction methods and feature extraction methods based on natural language processing technology. Hand-crafted feature extraction methods are designed by humans, offering low computational complexity and strong interpretability, such as binary encoding7, kme8, quasi-sequence-order (QSO)9. However, their effectiveness depends on the characteristics of the data itself, requiring the selection of appropriate descriptors based on the dataset’s properties. Feature extraction methods 1School of Mathematics and Statistics, Xidian University, Xi’an 710071, P. R. China. 2School of Science, Xi’an Polytechnic University, Xi’an 710048, P. R. China. email: Scientific Reports | (2024) 14:22518 | https://doi.org/10.1038/s41598-024-74326-3 1 www.nature.com/scientificreports/ based on natural language can uncover structures and patterns that are difficult for humans to detect and are less influenced by the dataset’s inherent characteristics. Nevertheless, their drawback is that they struggle to learn effective features when the data quality is poor. Word embedding10 is one of the most basic feature extraction methods in natural language processing, and many models are built on it, such as Bert11,12. In addition, there are pre-trained language models such as ProtTrans13, evolutionary scale modeling (ESM)14. Of course, there are also many models that consider both types of methods in order to extract more comprehensive information15. Machine learning and deep learning algorithms are used for further feature mining and classification. Common machine learning methods include AdaBoost16, random forest (RF)17, hidden markov model (HMM)18, etc. While they are simple and easy to understand, they are not well-suited for handling large, high-dimensional data, resulting in lower model accuracy. In contrast, deep learning, with its ability to automatically learn features, has demonstrated exceptional performance, far surpassing traditional machine learning algorithms, such as convolutional neural network (CNN)19, capsule network20, recurrent neural network (RNN)21, transformer22,23. Machine learning and deep learning each have their own unique advantages, and sometimes work well when combined24. Most of the existing identification models for hemolytic peptides use traditional hand-crafted features and machine learning algorithms, such as HemoPI25, HemoPred26, HLPpred-Fuse27, HAPPENN28, HemoPImod29. The involved feature extraction methods collect the information of hemolytic peptides from various aspects such as amino acid composition, peptide composition, physicochemical properties, and atomic descriptors. Classifiers cover almost all common machine learning algorithms. However, the methods and datasets used by these models are outdated. Moreover, HemoPImod is unable to identify hemolytic peptide sequences longer than 25 amino acids. Language model-based methods did not start to appear until 2021. HemoNet30 is the first to employ the SeqVec language model to capture the contextual features of amino acids, but it struggles to generalize well to unseen data. With the development of stronger transformer models, AMPDeep first used transformer-based pretrained model (PROT-ERT-BFD) to represent the features of peptide sequences in 202231. However, its (...truncated)