SSDLog: a semi-supervised dual branch model for log anomaly detection
World Wide Web
https://doi.org/10.1007/s11280-023-01174-y
SSDLog: a semi-supervised dual branch model for log
anomaly detection
Siyang Lu1 · Ningning Han1 · Mingquan Wang2 · Xiang Wei2 · Zaichao Lin1 ·
Dongdong Wang3
Received: 22 February 2023 / Revised: 31 March 2023 / Accepted: 8 April 2023
© The Author(s) 2023
Abstract
With versatility and complexity of computer systems, warning and errors are inevitable. To
effectively monitor system’s status, system logs are critical. To detect anomalies in system
logs, deep learning is a promising way to go. However, abnormal system logs in the real world
are often difficult to collect, and effectively and accurately categorize the logs is an even timeconsuming project. Thus, the data incompleteness is not conducive to the deep learning for
this practical application. In this paper, we put forward a novel semi-supervised dual branch
model that alleviate the need for large scale labeled logs for training a deep system log
anomaly detector. Specifically, our model consists of two homogeneous networks that share
the same parameters, one is called weak augmented teacher model and the other is termed
as strong augmented student model. In the teacher model, the log features are augmented
with small Gaussian noise, while in the student model, the strong augmentation is injected
to force the model to learn a more robust feature representation with the guidance of teacher
model provided soft labels. Furthermore, to further utilize unlabeled samples effectively, we
propose a flexible label screening strategy that takes into account the confidence and stability
of pseudo-labels. Experimental results show favorable effect of our model on prevalent HDFS
and Hadoop Application datasets. Precisely, with only 30% training data labeled, our model
can achieve the comparable results as the fully supervised version.
Keywords Log anomaly detection · Semi-supervised learning · Distributed system ·
Dual branch
1 Introduction
With the exponential growth of computing tasks and data, distributed parallel computing
systems (DPCS) are increasingly widely adopted to make full use of hardware resources to
achieve rapid and effective task deployment. While DPCS are effective in many ways, they
B Siyang Lu
Extended author information available on the last page of the article
123
World Wide Web
are not easy to maintain and manage, which can cause some serious system problems. To
effectively monitor the system’s health, system logs are usually collected for diagnosing.
Generally, the log-based anomaly detection method is realized by mining a large amount
of system log data and conducting effective classification, which can be simply treated as
a binary classification task. At present, for this task, there are mainly machine learning
algorithms based on shallow method [1–5] and methods based on deep model [6–10]. For
the traditional machine learning methods, they do not need a large amount of well annotated
data to achieve satisfactory classification effect. For deep models, although their accuracy is
much higher than that of traditional machine learning, the dependence on massive data often
makes the model inadequate in real applications. For the problem we faced in log anomaly
detection, due to the diversity of anomalies and the large number of logs, effective annotation
of data is not energy-consuming, and in some real distributed systems, abnormal data is often
difficult to collect, resulting in more scarce effective annotation.
In this paper, to tackle the above mentioned issue in log anomaly detection, we propose a
innovative semi-supervised dual branch Log anomaly detection model dubbed SSDLog that
alleviate the need for large scale labeled logs for training a deep system log anomaly detector.
Specifically, in the training process of our model, we do not need a large amount of labeled
data like the conventional deep models, but can achieve the comparative training effect as
them with the support of only 30% label data and the remaining 70% unlabeled one. More
explicitly our model consists of two homogeneous networks that share the same parameters,
one is called weak augmented teacher model and the other is termed as strong augmented
student model. In the teacher model, the log features are augmented with small Gaussian
noise, while in the student model, the strong augmentation is injected to force the model to
learn a more robust feature representation with the guidance of teacher model, which can
provide a more stable soft pseudo-labels for supervised training. In the meantime, to further
utilize unlabeled samples effectively, we propose a label screening strategy that takes into
account the confidence and stability of pseudo-labels, which can further obtain reliable and
learnable training samples.
To summarize, the following three contributions are made in this paper:
1. A novel semi-supervised dual branch model for efficient Log anomaly detection is proposed with only a few labels available in the training set. To the best of our knowledge,
this is one of the latest attempts for the Log anomaly detection task in real scenario.
2. A novel teacher-student semi-supervised model integrated with flexible label screening
strategy is established to efficiently detect system Log anomalies.
3. Our model has achieved superior performance on prevalent system Log anomaly detection
datasets. In particular, the performance of our SSDLog is comparable to that of the fulllabeled methods [7, 11] under the condition of only 30% labeled data available.
The remainder of the paper is organized as follows. In Section 2, we survey recent related
work. Section 3 details our SSDLog approach. Sections 4 and 5 present the experimental
results on multiple datasets. Finally, we conclude the paper in Section 6 with a summary and
an outlook on future work.
2 Related work
This section will introduce related work from two aspects: log anomaly detection and semisupervised deep learning.
123
World Wide Web
2.1 Log anomaly detection approaches
Before deep learning take over computer vision and NLP, statistic approaches and non-deep
machine learning approaches are mainly leveraged in log analysis field. On the one hand,
classic statistical methods calculate specific features manually extracted from log data. There
are several traditional statistic approaches, include PCA-based approach [12].In addition,
Safyallah et al. [13] analyze frequent and common log sequence execution path to detect
anomalies. Fu et al. [14] use rule-based method to identify log templates and detect anomalies
in distributed system logs.
To prevent certain features extracted by statistic approaches from affecting the effect
of log anomaly detection, many studies emerged who using non-deep machine learning
methods to detect anomalies. The related approaches include, SVM-based approaches [1,
2], Bayesian Learning-based model [3], Decision Tress-based model [4], HMM(Hidden
Markov Model)-based approach [ (...truncated)