Data Mining-based DNS Log Analysis
Ann. Data. Sci. (2014) 1(3–4):311–323
DOI 10.1007/s40745-014-0023-7
Data Mining-based DNS Log Analysis
Hongyuan Cui · Jiajun Yang · Ying Liu ·
Zheng Zheng · Kaichao Wu
Received: 15 October 2014 / Revised: 20 November 2014 / Accepted: 20 December 2014 /
Published online: 17 January 2015
© Springer-Verlag Berlin Heidelberg 2015
Abstract Domain name system (DNS) provides a critical function in directing Internet traffic. Defending DNS servers from bandwidth attacks is a significant task of DNS
service providers. Traditional rule-based anomaly or intrusion detection methods are
not able to update the rules dynamically. Data mining based approaches are able to
find various patterns in the massive dynamic query traffic data. The patterns may
assist the DNS service providers to detect anomalies in real time. In this paper, a novel
frequent episode mining algorithm is proposed, as well as a volume trend prediction
method which allows anomalies to be detected in real time. Density-based clustering
approach is adopted to partition numerous domain names into different groups based
on the characteristics of their query volume time series. Consistent episode mining
method is proposed to find how the query traffic ‘propagate’ at different time between
different domain names. Experiments are performed on a real-word DNS log data
H. Cui · J. Yang · Y. Liu (B)
School of Computer and Control, University of Chinese Academy of Sciences, Beijing, China
e-mail:
H. Cui
e-mail:
J. Yang
e-mail:
Y. Liu
Fictitious Economy and Data Science Research Center,
Chinese Academy of Sciences, Beijing, China
Z. Zheng · K. Wu
Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
e-mail:
K. Wu
e-mail:
123
312
Ann. Data. Sci. (2014) 1(3–4):311–323
set. Interesting patterns are presented, indicating data mining based approaches are
suitable and promising in the domain of DNS service.
Keywords Data mining · Clustering · Frequent pattern mining · DNS ·
Anomaly detection
1 Introduction
Domain name system (DNS) is a hierarchical distributed naming system for computers,
services, or any resource connected to the Internet. A DNS resolves queries for URLs
into IP addresses for the purpose of locating computer services and devices worldwide.
By providing a worldwide, distributed keyword-based redirection service, DNS is an
essential component of the functionality of the Internet.
With the ever increasing network flow and complexity of network topology, problems often happen in DNS service. For example, a large-scale network break-down
happened in 6 provinces in China in 2009 due to an attack to the DNS server by a
hacker. China has been one of the countries suffering from enormous network attacks
in the world. Security has become a crucial problem in DNS service. Thus, it is an
important task for DNS service providers to detect and report anomalies or exceptions
as early as possible, and reduce the loss resulted from the unexpected events. Another
important task is to provide high quality service to Web users.
Traditional methods in DNS security are rule-based methods. DNS experts have
to identify the characteristics or features of any abnormal behavior from historical
data offline, and then explicitly provide them to the monitoring system in the form
of rules. However, such rule-based methods have two serious weaknesses: (1) the
rules are not easy to update since efforts of domain experts are required. However, the
patterns of abnormal behaviors in the network are evolving dramatically, and thereby
the effectiveness of the detection system will be significantly reduced; (2) the size of
historical data set collected by DNS system is so huge that beyond the ability of human
being to analyze. For example, the number of query records captured by DNS log at
a top level domain server is over 40 billion in a single month. Automatic quantitative
analysis techniques on massive DNS data are in real demand.
Data mining is a kind of technique that can discover interesting, meaningful and
understandable patterns hidden in massive datasets. The patterns discovered by data
mining can be utilized in decision-making in many domains. To our best knowledge,
data mining has not been widely used in DNS query traffic analysis yet. Therefore,
in this paper, we explore to solve problems in DNS service by applying various data
mining methods. Our contributions are listed as follows:
(1) In order to predict the traffic volume at a domain name and prevent attacks by
hackers, we propose a volume prediction method. It discovers the frequently
occurred query volume trend patterns from the most recent DNS log. If the current
query volume at a domain name does not match with the predicted trend, an
anomaly alarm will be delivered to the system instantly.
(2) In order to have a deep understanding of the features of the query traffic of different
domain names, we partition the query traffic time series from all the domain
123
Ann. Data. Sci. (2014) 1(3–4):311–323
313
names into distinct clusters by adopting a density-based clustering algorithm.
The representative query traffic series of each cluster is referred as the query
traffic pattern. Such results provide us a chance to further investigate the browsing
patterns of the Web users or identify the common features of various anomalous
queries.
(3) A consistent pattern based traffic volume monitoring and anomaly prediction
method is proposed. If a frequent episode fe happens on a large portion of days at
a given DNS server at a certain time, it is called a consistent pattern. All the DNS
servers that have a common fe are clustered into a same group. Once an abnormal
query volume is observed at a DNS server, a warning message will be sent out to
the other members in the cluster. This method provides us a chance to predict the
abnormal volume before it really takes place.
(4) The effectiveness of our proposed methods is examined by a real-world DNS log
dataset and the experimental results are presented.
The rest of this paper is organized as follows. Section 2 overviews some related
work. In sect. 3, we present our query volume prediction method. In sect. 4, we
briefly introduce DBSCAN clustering algorithm and present the clustering results.
Section 5 presents our consistent pattern based volume monitoring and anomaly prediction method. Section 6 summaries our current work.
2 Related Work
Since query traffic flow is an accurate reflection of DNS service, anomaly detection
in query traffic has been paid more and more attention. For example, Jung et al. [1]
proposed a novel method to detect anomaly in SMTP Client by DNS query traffic.
Ishibashi et al. [2] proposed a method to discover junk mail senders by studying ISP
DNS. But in some circumstance, DNS itself can be part of the attack in Internet like
DDoS [3] and DNS cache poisoning [4].
Ji et al. [5] proposed a K-means clustering based algorithm to cluster the temporal
behaviors of IP addresses and domai (...truncated)