Malware Classification Based on the Behavior Analysis and Back Propagation Neural Network

ITM Web of Conferences, Jan 2016

With the development of the Internet, malwares have also been expanded on the network systems rapidly. In order to deal with the diversity and amount of the variants, a number of automated behavior analysis tools have emerged as the time requires. Yet these tools produce detailed behavior reports of the malwares, it still needs to specify its category and judge its criticality manually. In this paper, we propose an automated malware classification approach based on the behavior analysis. We firstly perform dynamic analyses to obtain the detailed behavior profiles of the malwares, which are then used to abstract the main features of the malwares and serve as the inputs of the Back Propagation (BP) Neural Network model.The experimental results demonstrate that our classification technique is able to classify the malware variants effectively and detect malware accurately.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://www.itm-conferences.org/articles/itmconf/pdf/2016/02/itmconf_ita2016_02001.pdf

Malware Classification Based on the Behavior Analysis and Back Propagation Neural Network

ITM Web of Conferences Malware Classification Based on the Behavior Analysis and Back Propagation Neural Network Zhi-Peng PAN 0 Chao FENG 0 Chao-Jing TANG 0 0 College of Electronic Science and Engineering, National University of Defense Technology , 410073 Changsha , China With the development of the Internet, malwares have also been expanded on the network systems rapidly. In order to deal with the diversity and amount of the variants, a number of automated behavior analysis tools have emerged as the time requires. Yet these tools produce detailed behavior reports of the malwares, it still needs to specify its category and judge its criticality manually. In this paper, we propose an automated malware classification approach based on the behavior analysis. We firstly perform dynamic analyses to obtain the detailed behavior profiles of the malwares, which are then used to abstract the main features of the malwares and serve as the inputs of the Back Propagation (BP) Neural Network model.The experimental results demonstrate that our classification technique is able to classify the malware variants effectively and detect malware accurately. - Malicious software (Malware), usually in forms of virus, Trojans, worms, botnets, rootkits, and some other potentially unwanted applications, has been the major threat to the internet security. Malware developers use the hiding techniques such as polymorphism and obfuscation [ 1 ] to against signature-based on detection and static malware analysis methods easily and effectively [ 2 ]. In contrast to static analysis, dynamic analysis of malware based on monitoring its behavior during the run-time, which renders the malware more difficult to conceal [ 3, 4 ], and it does become the mainstream method of malicious behavior mining. Yet these dynamic technologies for the malware detection are not sufficient just by forming detailed behavior profiles [ 5 ]. What we need is the ability to automatically categorize of the malware and detect the malware by its behavior. The main contributions of this paper are as follows: 1)Unlike many previous algorithms that monitor the malware behaviors directly on low-level data such as API call monitoring [ 6 ], we implement an automatic dynamic analysis framework by taking the advantages of the present behavior analysis systems. We get the detailed behaviors of the malware including process behaviors, registry behaviors, file behaviors, net behaviors, and other behaviors. 2)We extract the major features of the malware behavior profiles into the behavior vectors by counting the quantity of the every behavior. 3)We proposed a Back Propagation (BP) Neural Network model [ 7 ] for learning the behavior patterns of the same categories of the malwares and classifying the malwares. The experimental section verified the correctness and precision of our algorithm finally. 2 Related Work There are three major methods for classification of the malicious softwares, traditional pattern matching, static analysis, and dynamic analysis. Although static analysis can improve accuracy than the methods of the traditional pattern matching, it can also have the difficulty to handle obfuscated and self-modifying codes. In dynamic analysis, Konrad Rieck [ 8 ] et al. proposed a method using the CWSandbox to analysis the behaviors of the malwares and then using the Support Vector Machines (SVM) for learning and classification. Forrest [ 6 ] proposed fixed-length sequence of N-gram recognition model based on system call. Syed Zainudeen Mohd Shaid [ 9 ] proposed a behavior-based technique to visualize malware behavior in the form of images. This method uses the different color to indicate the different API calls. By using the behavior images, it can be possible to visually identify malware variants of the same family. Guanghui Liang [ 10 ] et al. capture malware behaviors based on the Temu platform and proposed a weighted Jaccard similarity matching algorithm to classify the malware variants. In summary, when dealing with the malware variants classification, behavior analysis is the most effective method. In this paper, we use the present effective behavior analysis system to help analyse of the malware in contrast to just monitoring the API calls or the traces of the APIs as above. 3 Methodology Classification of malware variants has been concerned by analysts in a long period [ 11-14 ]. Evolving malware generates a lot of variants and brings great challenges to analytical work. Although these variants change in the file format and appearance, there are still the same behavior patterns. For example, all variants of the Allaple worm acquire and lock of particular mutexes on infected systems [8]. Aiming to exploit these behavior patterns using machine learning techniques and propose a method which can classify the malware variants automatically based on their behaviors. An outline of our approach is given by the following basic steps: 1) Malware Data Acquisition. A corpus of malware binaries are obtained by collecting the upload suspicious files on the Kafan Forum. A Multi-Engine Online Virus Scan system VirSCAN is applied to identify the known malware instances. 2)Behavior Monitoring. Malware binaries are executed and monitored by the HABO behavior analysis system, which can generate detailed behavior reports. 3)Feature Extraction. Features reflect the behavior patterns, such as process created, foreign memory regions read, mutexes created, or registry key modified, are extracted from the analysis reports and used to map the malware behavior into a high-dimensional vector space. 4)Learning and classification. Back Propagation neural network model is applied to learning and training for the classification of the malwares. 3.1 Malware data acquisition We have obtained up to 13600 unique samples, which are uploaded by the extensive users of the Kafan Forum, using for learning and subsequent classification. After obtaining the samples, we applied the online virus scan system VirSCAN to partition the malwares into common families, such as Adware, Potential Unwanted Application (PUA), Trojan/Downloader. Note that we chose the VirSCAN instead of one Unti-virus product, like Avira, Karpasky, to label the malware as the VirSCAN is multi-engine and we can chose the most of the result to label our sample. We selected 9 most common malware categories and one Non-Malware category on our samples. These families listed in Table 1 represent a broad range of malware categories such as Adware, PUA, and Trojans, and the Non-Malware category can be extend for malware detection directly in the future. Table 1 Malware Families Labeled by The Virscan System 3.2 Behavior Monitoring and Feature Extraction In this section, we use the online behavior analysis system, which called HABO, to monitor the samples’ behavior. Like most of the other online behavior analysis systems, it can analyse the upload binaries and give you a detail behavior report about the malware. Note that our methodology is not bound to the HABO system; it can also be adapted to other behavior analysis systems. Figure 1 shows a part of behavior report of one malware sample. It contains five main aspects, process behaviours, file behaviours, register behaviours, net behaviours, and other behaviours. Furthermore, these 5 main aspects contain 73 sub-behaviors which describe the behavior of the malware in detail. Although the reports show the detail information of the behavior of the malware, it can’t be used for the BP neutral network model directly, which needs the vectorial data as the input. Hence we should extract the main features of the malwares’ behavior from the reports firstly. The method used here is called frequency statistics method. The main steps are showed as follows: 1) Given all the sub-behaviors, we represent them at a particular sequence, such as, create local thread, enumerate process, create a new file process, and so on. 2) We made a count on the all sub-behaviors of the malware respectively. For example, the behavior showed in Figure 1, we can use [2, 1, 1, and 15,] to represent. The figure 2 shows a detail behavior vector of one malware, and the number zero means that this malware doesn’t have the behavior correspondingly. 3.3 Learning and classification 3.3.1 Establish the Model According to the previous description, the norm of our behavior vector is 73. And we can use the vector, r = [r1, r2, r3,…, r10] ri = 0 or 1, to represent the output result. Each element value of the output vector is 0 or 1, 0 represents the malware does not belong to the corresponding category while 1 represents the malware belong to the corresponding category. Given the input vector and the output vector, we established a Back Propagation (BP) Neural Network, which includes one input layer, one hidden layer and one output layer. The input layer and the hidden layer both have 73 neurons and the output layer has 10 neurons. Additionally, the hidden and output neurons include adjustment factor a and b respectively. The connection weight between input layer and hidden layer, hidden layer and output layer is noted by Wij 1  i  73,1  j  73 and Vjl 1  j  73,1  l  10 . The network is showed in activation function of the hidden layer, and for convenience, the input of the network, hidden neuron hj and output neuron Ol is denoted by xi 1  i  73 , net hj 1  j  73 and net Ol 1  l  10 respectively. The output of the hidden neuron hj and output neuron Ol is out hj  and out Ol  . We define the error signals of the output layer and the hidden layer as  oh and  hi 1 l 2 respectively. Additional, E   (target(Ol )  out(Ol )) is 2 k 1 defined as output error. So according to the Back Propagation Neural Networks algorithm, we can derive the expression of the connection weight: Vjl  Vjl  oh \* MERGEFORMAT (1) Wij  Wij  hi \* MERGEFORMAT (2) net(Ol ) Vjl net(hj ) Wij E out(hj ) E out(Ol ) where  oh    out, (Ol ) ,  hi    out, (hj ) and the symbol  means the learning rate of the network. In this paper, the transform factor between the output layer and the hidden layer is denoted by a , and b means the transform factor between the hidden layer and the output layer. 1) The identification of the output function Due to there is no experienced output function, we can only get the function by experiments. For the same samples, when trained 10000 times, we can get the results as shown in Table 2. And from the table, we finally choose the linear function f  x  1.0  x / 5000 . Due to the value of the adjustment factors a and b has the important influence on the speed of the convergence. Figure 4 and Figure 5 show us the relationships between the adjustment factors and the convergence time, when the total error is set to be 0.64. 5 a/10 1 2 3 4 6 7 8 9 10 In order to evaluate the performance of our methodology, we firstly divided the malware corpus randomly into training and testing two partitions, and the samples sizes are 10000 and 3600 respectively. We used the training partition to train the BP neutral network, and used the testing partition to measure the overall performance of our methodology. Besides, the procedure showed above is repeated over five independent experimental runs and we use the average values as our final results. The per-category accuracy for this experiment is shown in Figure 6, and the error bars indicate the variance measured during the experiment runs. From the figure, we can find our average accuracy is up to 86%. And in particular, we can find the last category, which we defined it as the non-malware, whose predict accuracy is up to 99%. In other words, if we used this model to detect the binary in our corpus is whether the malware or not, we have the correct probability approximate to 99%. This result shows that our methodology can be easily extend for malware detection. And more deeply, due to the boundaries of categories 3,4,5,6 and 9 are less obvious, which labelled as Trojan, Tr/downloader, Tr/Crypt, Tr/dropper amd Win32, we find that the variance of these categories are higher than other categories. Figure 7 shows the confusion matrix for classification. If the color of the category is deeper, it means that this category is less error probability be classified into other categories. From the figure, we can find that the categories between 3 and 7 are easily be confused each other, and the category Win32 is most likely to be considered into other categories, which are consistent with our actual situation. In this paper, the behavior of the malware is captured by the online behavior analyze system. After that, we extracted the main feature of the malware in forms of vector and serve it as the input of out Back Propagation Neural Network model. Finally, by training the BP Neural Network, we can use it to classify the malware and detect the malware. Experimental results show that our methodology can classify the malware variants effectively and detect the malware accurately. In the future work, we will focus on how to extracted malware feature that can represent the malware more accurately from the behavior analysis reports. 1. Rad B B , Masrom M , Ibrahim S. Camouflage in malware: from encryption to metamorphism [J]. International Journal of Computer Science and Network Security , 2012 , 12 ( 8 ): 74 - 83 . 2. Moser A , Kruegel C , Kirda E. Limits of static analysis for malware detection [C]//Computer security applications conference, 2007 . ACSAC 2007. Twenty-third annual . IEEE , 2007 : 421 - 430 . 3. Willems C , Holz T , Freiling F . Toward automated dynamic malware analysis using cwsandbox[J] . IEEE Security & Privacy , 2007 (2): 32 - 39 . 4. Egele M , Scholte T , Kirda E , et al. A survey on automated dynamic malware-analysis techniques and tools[J] . ACM Computing Surveys (CSUR) , 2012 , 44 ( 2 ): 6 . 5. Bayer U , Comparetti P M , Hlauschek C , et al. Scalable, Behavior-Based Malware Clustering [C]//NDSS. 2009 , 9 : 8 - 11 . 6. Forrest S , Hofmeyr S , Somayaji A . The evolution of system-call monitoring [C]//Computer Security Applications Conference, 2008 . ACSAC 2008 . Annual . IEEE, 2008 : 418 - 430 . 7. Irwin G W , Warwick K , Hunt K J. Neural network applications in control[M]. Iet , 1995 . 8. Rieck K , Holz T , Willems C , et al. Learning and classification of malware behavior[M]//Detection of Intrusions and Malware , and Vulnerability Assessment. Springer Berlin Heidelberg, 2008 : 108 - 125 . 9. Shaid M , Zainudeen S , Maarof M A . Malware behavior image for malware variant identification [C]//Biometrics and Security Technologies (ISBAST), 2014 International Symposium on. IEEE , 2014 : 238 - 243 . 10. Liang G , Pang J , Dai C. A Behavior-Based Malware Variant Classification Technique [J]. International Journal of Information and Education Technology , 2016 , 6 ( 4 ): 291 . 11. Park Y , Reeves D , Mulukutla V , et al. Fast malware classification by automated behavioral graph matching[C]// Proceedings of the Sixth Annual Workshop on Cyber Security and Information Intelligence Research . ACM, 2010 : 45 . 12. Cesare S , Xiang Y. Classification of malware using structured control flow[C]//Proceedings of the Eighth Australasian Symposium on Parallel and Distributed Computing-Volume 107 . Australian Computer Society, Inc., 2010 : 61 - 70 . 13. Tian R , Batten L M, Versteeg S C. Function length as a tool for malware classification [C]//Malicious and Unwanted Software , 2008 . MALWARE 2008 . 3rd International Conference on. IEEE, 2008 : 69 - 76 . 14. Islam R , Tian R , Batten L , et al. Classification of malware based on string and function feature selection[C]//Cybercrime and Trustworthy Computing Workshop (CTC), 2010 Second. IEEE, 2010 : 9 - 17 .


This is a preview of a remote PDF: https://www.itm-conferences.org/articles/itmconf/pdf/2016/02/itmconf_ita2016_02001.pdf

Zhi-Peng Pan, Chao Feng, Chao-Jing Tang. Malware Classification Based on the Behavior Analysis and Back Propagation Neural Network, ITM Web of Conferences, 2016, DOI: 10.1051/itmconf/20160702001