Edge computing empowered anomaly detection framework with dynamic insertion and deletion schemes on data streams
World Wide Web
https://doi.org/10.1007/s11280-022-01052-z
Edge computing empowered anomaly detection framework
with dynamic insertion and deletion schemes on data
streams
Haolong Xiang1
· Xuyun Zhang1
Received: 2 November 2021 / Revised: 17 March 2022 / Accepted: 7 April 2022
© The Author(s) 2022
Abstract
Anomaly detection plays a crucial role in many Internet of Things (IoT) applications
such as traffic anomaly detection for smart transportation and medical diagnosis for smart
healthcare. With the explosion of IoT data, anomaly detection on data streams raises higher
requirements for real-time response and strong robustness on large-scale data arriving at
the same time and various application fields. However, existing methods are either slow or
application-specific. Inspired by the edge computing and generic anomaly detection technique, we propose an isolation forest based framework with dynamic Insertion and Deletion schemes (IDForest), which can incrementally update the forest to detect anomalies
on data streams. Besides, IDForest is deployed on edge servers in parallel through packing each tree into a subtask, which facilitates the fast anomaly detection on data streams.
Extensive experiments on both synthetic and real-life datasets demonstrate the efficiency
and robustness of our framework for anomaly detection.
Keywords Anomaly detection · Data streams · Large-scale data · Edge computing ·
Efficiency and robustness
1 Introduction
The application of Internet of Things (IoT) technologies to the smart world has improved
life quality and attracted significant attention in academia [4, 28]. With fast development and wide deployment of IoT technologies, the size of the data has exploded, which
This article belongs to the Topical Collection: Special Issue on Resource Management at the Edge for
Future Web, Mobile, and IoT Applications
Guest Editors: Qiang He, Fang Dong, Chenshu Wu, and Yun Yang
* Xuyun Zhang
Haolong Xiang
1
Faculty of Science and Engineering, Macquarie University, Sydney 2122, Australia
13
Vol.:(0123456789)
World Wide Web
comes from various intelligent applications, such as smart city, smart home, smart
hospital and smart farm [13, 22]. Large-scale IoT data increase the difficulty to detect,
quantify and understand the surrounding environments, where the criminals are more
likely to invade [29]. For instance, identifying hacker intrusions in massive network data
or detecting anomalous trends in industrial data that indicate a pending system failure
requires accurate and fast anomaly detection. In real life applications, these data get sampled over very short time intervals and keep flowing in infinitely leading to data streams,
which raise the requirement for real-time response to the abnormal event. Therefore,
developing effective and real-time anomaly detection techniques among the data stream
with large-scale data should be a research priority [12, 23].
Streams can be a time-series or multidimensional, and the data stream does not have
a fixed length compared with the static data [10]. For an infinite data stream, anomaly
detection is performed by a sliding window, which confines the data instances within the
fixed-size context. As the window slides, the expired data points are removed from the
window while an equal number of new data points are added to the window. Besides, the
anomalies are detected in each sliding window. For example, monitor the click rate of
shopping websites and find the anomalous click times is a typical time-series anomaly
detection on data streams. Besides, the real-time cardiac monitoring produces a kind of
multidimensional data streams, which collects the medical information from implanted
or wearable sensors and transmitted this information to a server for diagnosis [17]. However, these methods are all application-specific, executing anomaly detection on one field
of application or one type of data streams. With the increasing of application types, it is
a trouble work to design different kinds of anomaly detection methods. So, it is meaningful to design a generic framework for anomaly detection on data streams and improve the
robustness on various application fields.
Monitoring on data streams often requires real-time response to the anomalous events,
which increases the difficulty to execute efficient anomaly detection on data streams with
large-scale data instances. Limited by the capability of resource storage and computation on sensor-equipped devices, these intensive data are offloaded to cloud/edge servers
for storage and processing. Since edge servers are closer to devices in geography compared to cloud servers and the resources in the edge servers provide sufficient computing
and storage power for data streams, model deployment on edge servers is regarded as a
practical method to shorten the processing time on the data stream with massive data.
To illustrate the efficiency of edge computing, Mehnaz et al. recently made an experiment and found the processing time in smart devices is around 5 times longer than that
in edge servers over a data stream containing 100000 data points [21]. In this case, the
windowed Gaussian (W-Gaussian) anomaly detection method is used to detect anomalies. As a statistic-based method, W-Gaussian has good accuracy on the data following
a distribution while it may act poorly on the data not belonging to a normal distribution.
This work provides us with the idea of combining anomaly detection methods with edge
computing. However, it remains a big challenge to build an accurate anomaly detection
method over a data stream with complex data and deploy it to edge servers to monitor
data in real time.
Considering the distributed characteristics of edge computing, the distributed processing method is feasible to be deployed on edge servers to speed up. Among all types of
anomaly detection methods, ensemble-based anomaly detection methods can be broken
down into multiple concurrent tasks that can be handled independently. So, we consider
an integrated approach of ensemble-based anomaly detection method and edge computing
to solve the above problem. In the previous researches, ensemble-based isolation forest
13
World Wide Web
(iForest) is proposed to provide fast anomaly detection in big data. Benefiting from the
nature of the sampling-based ensemble, iForest possesses good detection accuracy with
short processing time over extensive datasets [18]. Based on this remarkable scheme,
Guha et al. proposed robust random cut forest (RRCF) to detect anomalies in dynamic
data stream [9]. RRCF method improves the original data partitioning of iForest and
update the tree structure through inserting and deleting leaves. Although the experiment
showed that RRCF can capture the beginning and end of the anomalous event on a single
data stream, it fails to provide perfect detection accuracy and real-time response on the
data stream with multidimensional data instances. Therefore, we aim to desi (...truncated)