Data Mining-based DNS Log Analysis

Annals of Data Science, Dec 2014

Domain name system (DNS) provides a critical function in directing Internet traffic. Defending DNS servers from bandwidth attacks is a significant task of DNS service providers. Traditional rule-based anomaly or intrusion detection methods are not able to update the rules dynamically. Data mining based approaches are able to find various patterns in the massive dynamic query traffic data. The patterns may assist the DNS service providers to detect anomalies in real time. In this paper, a novel frequent episode mining algorithm is proposed, as well as a volume trend prediction method which allows anomalies to be detected in real time. Density-based clustering approach is adopted to partition numerous domain names into different groups based on the characteristics of their query volume time series. Consistent episode mining method is proposed to find how the query traffic ‘propagate’ at different time between different domain names. Experiments are performed on a real-word DNS log data set. Interesting patterns are presented, indicating data mining based approaches are suitable and promising in the domain of DNS service.

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007%2Fs40745-014-0023-7.pdf

Data Mining-based DNS Log Analysis

Ann. Data. Sci. (2014) 1(3–4):311–323 DOI 10.1007/s40745-014-0023-7 Data Mining-based DNS Log Analysis Hongyuan Cui · Jiajun Yang · Ying Liu · Zheng Zheng · Kaichao Wu Received: 15 October 2014 / Revised: 20 November 2014 / Accepted: 20 December 2014 / Published online: 17 January 2015 © Springer-Verlag Berlin Heidelberg 2015 Abstract Domain name system (DNS) provides a critical function in directing Internet traffic. Defending DNS servers from bandwidth attacks is a significant task of DNS service providers. Traditional rule-based anomaly or intrusion detection methods are not able to update the rules dynamically. Data mining based approaches are able to find various patterns in the massive dynamic query traffic data. The patterns may assist the DNS service providers to detect anomalies in real time. In this paper, a novel frequent episode mining algorithm is proposed, as well as a volume trend prediction method which allows anomalies to be detected in real time. Density-based clustering approach is adopted to partition numerous domain names into different groups based on the characteristics of their query volume time series. Consistent episode mining method is proposed to find how the query traffic ‘propagate’ at different time between different domain names. Experiments are performed on a real-word DNS log data H. Cui · J. Yang · Y. Liu (B) School of Computer and Control, University of Chinese Academy of Sciences, Beijing, China e-mail: H. Cui e-mail: J. Yang e-mail: Y. Liu Fictitious Economy and Data Science Research Center, Chinese Academy of Sciences, Beijing, China Z. Zheng · K. Wu Computer Network Information Center, Chinese Academy of Sciences, Beijing, China e-mail: K. Wu e-mail: 123 312 Ann. Data. Sci. (2014) 1(3–4):311–323 set. Interesting patterns are presented, indicating data mining based approaches are suitable and promising in the domain of DNS service. Keywords Data mining · Clustering · Frequent pattern mining · DNS · Anomaly detection 1 Introduction Domain name system (DNS) is a hierarchical distributed naming system for computers, services, or any resource connected to the Internet. A DNS resolves queries for URLs into IP addresses for the purpose of locating computer services and devices worldwide. By providing a worldwide, distributed keyword-based redirection service, DNS is an essential component of the functionality of the Internet. With the ever increasing network flow and complexity of network topology, problems often happen in DNS service. For example, a large-scale network break-down happened in 6 provinces in China in 2009 due to an attack to the DNS server by a hacker. China has been one of the countries suffering from enormous network attacks in the world. Security has become a crucial problem in DNS service. Thus, it is an important task for DNS service providers to detect and report anomalies or exceptions as early as possible, and reduce the loss resulted from the unexpected events. Another important task is to provide high quality service to Web users. Traditional methods in DNS security are rule-based methods. DNS experts have to identify the characteristics or features of any abnormal behavior from historical data offline, and then explicitly provide them to the monitoring system in the form of rules. However, such rule-based methods have two serious weaknesses: (1) the rules are not easy to update since efforts of domain experts are required. However, the patterns of abnormal behaviors in the network are evolving dramatically, and thereby the effectiveness of the detection system will be significantly reduced; (2) the size of historical data set collected by DNS system is so huge that beyond the ability of human being to analyze. For example, the number of query records captured by DNS log at a top level domain server is over 40 billion in a single month. Automatic quantitative analysis techniques on massive DNS data are in real demand. Data mining is a kind of technique that can discover interesting, meaningful and understandable patterns hidden in massive datasets. The patterns discovered by data mining can be utilized in decision-making in many domains. To our best knowledge, data mining has not been widely used in DNS query traffic analysis yet. Therefore, in this paper, we explore to solve problems in DNS service by applying various data mining methods. Our contributions are listed as follows: (1) In order to predict the traffic volume at a domain name and prevent attacks by hackers, we propose a volume prediction method. It discovers the frequently occurred query volume trend patterns from the most recent DNS log. If the current query volume at a domain name does not match with the predicted trend, an anomaly alarm will be delivered to the system instantly. (2) In order to have a deep understanding of the features of the query traffic of different domain names, we partition the query traffic time series from all the domain 123 Ann. Data. Sci. (2014) 1(3–4):311–323 313 names into distinct clusters by adopting a density-based clustering algorithm. The representative query traffic series of each cluster is referred as the query traffic pattern. Such results provide us a chance to further investigate the browsing patterns of the Web users or identify the common features of various anomalous queries. (3) A consistent pattern based traffic volume monitoring and anomaly prediction method is proposed. If a frequent episode fe happens on a large portion of days at a given DNS server at a certain time, it is called a consistent pattern. All the DNS servers that have a common fe are clustered into a same group. Once an abnormal query volume is observed at a DNS server, a warning message will be sent out to the other members in the cluster. This method provides us a chance to predict the abnormal volume before it really takes place. (4) The effectiveness of our proposed methods is examined by a real-world DNS log dataset and the experimental results are presented. The rest of this paper is organized as follows. Section 2 overviews some related work. In sect. 3, we present our query volume prediction method. In sect. 4, we briefly introduce DBSCAN clustering algorithm and present the clustering results. Section 5 presents our consistent pattern based volume monitoring and anomaly prediction method. Section 6 summaries our current work. 2 Related Work Since query traffic flow is an accurate reflection of DNS service, anomaly detection in query traffic has been paid more and more attention. For example, Jung et al. [1] proposed a novel method to detect anomaly in SMTP Client by DNS query traffic. Ishibashi et al. [2] proposed a method to discover junk mail senders by studying ISP DNS. But in some circumstance, DNS itself can be part of the attack in Internet like DDoS [3] and DNS cache poisoning [4]. Ji et al. [5] proposed a K-means clustering based algorithm to cluster the temporal behaviors of IP addresses and domai (...truncated)


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1007%2Fs40745-014-0023-7.pdf
Article home page: https://link.springer.com/article/10.1007/s40745-014-0023-7

Hongyuan Cui, Jiajun Yang, Ying Liu, Zheng Zheng, Kaichao Wu. Data Mining-based DNS Log Analysis, Annals of Data Science, 2014, pp. 311-323, Volume 1, Issue 3-4, DOI: 10.1007/s40745-014-0023-7