SSDLog: a semi-supervised dual branch model for log anomaly detection (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s11280-023-01174-y.pdf

SSDLog: a semi-supervised dual branch model for log anomaly detection

World Wide Web https://doi.org/10.1007/s11280-023-01174-y SSDLog: a semi-supervised dual branch model for log anomaly detection Siyang Lu1 · Ningning Han1 · Mingquan Wang2 · Xiang Wei2 · Zaichao Lin1 · Dongdong Wang3 Received: 22 February 2023 / Revised: 31 March 2023 / Accepted: 8 April 2023 © The Author(s) 2023 Abstract With versatility and complexity of computer systems, warning and errors are inevitable. To effectively monitor system’s status, system logs are critical. To detect anomalies in system logs, deep learning is a promising way to go. However, abnormal system logs in the real world are often difficult to collect, and effectively and accurately categorize the logs is an even timeconsuming project. Thus, the data incompleteness is not conducive to the deep learning for this practical application. In this paper, we put forward a novel semi-supervised dual branch model that alleviate the need for large scale labeled logs for training a deep system log anomaly detector. Specifically, our model consists of two homogeneous networks that share the same parameters, one is called weak augmented teacher model and the other is termed as strong augmented student model. In the teacher model, the log features are augmented with small Gaussian noise, while in the student model, the strong augmentation is injected to force the model to learn a more robust feature representation with the guidance of teacher model provided soft labels. Furthermore, to further utilize unlabeled samples effectively, we propose a flexible label screening strategy that takes into account the confidence and stability of pseudo-labels. Experimental results show favorable effect of our model on prevalent HDFS and Hadoop Application datasets. Precisely, with only 30% training data labeled, our model can achieve the comparable results as the fully supervised version. Keywords Log anomaly detection · Semi-supervised learning · Distributed system · Dual branch 1 Introduction With the exponential growth of computing tasks and data, distributed parallel computing systems (DPCS) are increasingly widely adopted to make full use of hardware resources to achieve rapid and effective task deployment. While DPCS are effective in many ways, they B Siyang Lu Extended author information available on the last page of the article 123 World Wide Web are not easy to maintain and manage, which can cause some serious system problems. To effectively monitor the system’s health, system logs are usually collected for diagnosing. Generally, the log-based anomaly detection method is realized by mining a large amount of system log data and conducting effective classification, which can be simply treated as a binary classification task. At present, for this task, there are mainly machine learning algorithms based on shallow method [1–5] and methods based on deep model [6–10]. For the traditional machine learning methods, they do not need a large amount of well annotated data to achieve satisfactory classification effect. For deep models, although their accuracy is much higher than that of traditional machine learning, the dependence on massive data often makes the model inadequate in real applications. For the problem we faced in log anomaly detection, due to the diversity of anomalies and the large number of logs, effective annotation of data is not energy-consuming, and in some real distributed systems, abnormal data is often difficult to collect, resulting in more scarce effective annotation. In this paper, to tackle the above mentioned issue in log anomaly detection, we propose a innovative semi-supervised dual branch Log anomaly detection model dubbed SSDLog that alleviate the need for large scale labeled logs for training a deep system log anomaly detector. Specifically, in the training process of our model, we do not need a large amount of labeled data like the conventional deep models, but can achieve the comparative training effect as them with the support of only 30% label data and the remaining 70% unlabeled one. More explicitly our model consists of two homogeneous networks that share the same parameters, one is called weak augmented teacher model and the other is termed as strong augmented student model. In the teacher model, the log features are augmented with small Gaussian noise, while in the student model, the strong augmentation is injected to force the model to learn a more robust feature representation with the guidance of teacher model, which can provide a more stable soft pseudo-labels for supervised training. In the meantime, to further utilize unlabeled samples effectively, we propose a label screening strategy that takes into account the confidence and stability of pseudo-labels, which can further obtain reliable and learnable training samples. To summarize, the following three contributions are made in this paper: 1. A novel semi-supervised dual branch model for efficient Log anomaly detection is proposed with only a few labels available in the training set. To the best of our knowledge, this is one of the latest attempts for the Log anomaly detection task in real scenario. 2. A novel teacher-student semi-supervised model integrated with flexible label screening strategy is established to efficiently detect system Log anomalies. 3. Our model has achieved superior performance on prevalent system Log anomaly detection datasets. In particular, the performance of our SSDLog is comparable to that of the fulllabeled methods [7, 11] under the condition of only 30% labeled data available. The remainder of the paper is organized as follows. In Section 2, we survey recent related work. Section 3 details our SSDLog approach. Sections 4 and 5 present the experimental results on multiple datasets. Finally, we conclude the paper in Section 6 with a summary and an outlook on future work. 2 Related work This section will introduce related work from two aspects: log anomaly detection and semisupervised deep learning. 123 World Wide Web 2.1 Log anomaly detection approaches Before deep learning take over computer vision and NLP, statistic approaches and non-deep machine learning approaches are mainly leveraged in log analysis field. On the one hand, classic statistical methods calculate specific features manually extracted from log data. There are several traditional statistic approaches, include PCA-based approach [12].In addition, Safyallah et al. [13] analyze frequent and common log sequence execution path to detect anomalies. Fu et al. [14] use rule-based method to identify log templates and detect anomalies in distributed system logs. To prevent certain features extracted by statistic approaches from affecting the effect of log anomaly detection, many studies emerged who using non-deep machine learning methods to detect anomalies. The related approaches include, SVM-based approaches [1, 2], Bayesian Learning-based model [3], Decision Tress-based model [4], HMM(Hidden Markov Model)-based approach [ (...truncated)