Malware Classification Based on the Behavior Analysis and Back Propagation Neural Network
ITM Web of Conferences
Malware Classification Based on the Behavior Analysis and Back Propagation Neural Network
Zhi-Peng PAN 0
Chao FENG 0
Chao-Jing TANG 0
0 College of Electronic Science and Engineering, National University of Defense Technology , 410073 Changsha , China
With the development of the Internet, malwares have also been expanded on the network systems rapidly. In order to deal with the diversity and amount of the variants, a number of automated behavior analysis tools have emerged as the time requires. Yet these tools produce detailed behavior reports of the malwares, it still needs to specify its category and judge its criticality manually. In this paper, we propose an automated malware classification approach based on the behavior analysis. We firstly perform dynamic analyses to obtain the detailed behavior profiles of the malwares, which are then used to abstract the main features of the malwares and serve as the inputs of the Back Propagation (BP) Neural Network model.The experimental results demonstrate that our classification technique is able to classify the malware variants effectively and detect malware accurately.
Malicious software (Malware), usually in forms of virus,
Trojans, worms, botnets, rootkits, and some other
potentially unwanted applications, has been the major
threat to the internet security. Malware developers use the
hiding techniques such as polymorphism and obfuscation
] to against signature-based on detection and static
malware analysis methods easily and effectively [
contrast to static analysis, dynamic analysis of malware
based on monitoring its behavior during the run-time,
which renders the malware more difficult to conceal [
], and it does become the mainstream method of
malicious behavior mining.
Yet these dynamic technologies for the malware
detection are not sufficient just by forming detailed
behavior profiles [
]. What we need is the ability to
automatically categorize of the malware and detect the
malware by its behavior.
The main contributions of this paper are as follows:
1)Unlike many previous algorithms that monitor the
malware behaviors directly on low-level data such as API
call monitoring [
], we implement an automatic dynamic
analysis framework by taking the advantages of the
present behavior analysis systems. We get the detailed
behaviors of the malware including process behaviors,
registry behaviors, file behaviors, net behaviors, and
2)We extract the major features of the malware
behavior profiles into the behavior vectors by counting
the quantity of the every behavior.
3)We proposed a Back Propagation (BP) Neural
Network model [
] for learning the behavior patterns of
the same categories of the malwares and classifying the
malwares. The experimental section verified the
correctness and precision of our algorithm finally.
There are three major methods for classification of the
malicious softwares, traditional pattern matching, static
analysis, and dynamic analysis. Although static analysis
can improve accuracy than the methods of the traditional
pattern matching, it can also have the difficulty to handle
obfuscated and self-modifying codes.
In dynamic analysis, Konrad Rieck [
] et al. proposed
a method using the CWSandbox to analysis the behaviors
of the malwares and then using the Support Vector
Machines (SVM) for learning and classification. Forrest
] proposed fixed-length sequence of N-gram
recognition model based on system call. Syed Zainudeen
Mohd Shaid [
] proposed a behavior-based technique to
visualize malware behavior in the form of images. This
method uses the different color to indicate the different
API calls. By using the behavior images, it can be
possible to visually identify malware variants of the same
family. Guanghui Liang [
] et al. capture malware
behaviors based on the Temu platform and proposed a
weighted Jaccard similarity matching algorithm to
classify the malware variants.
In summary, when dealing with the malware variants
classification, behavior analysis is the most effective
method. In this paper, we use the present effective
behavior analysis system to help analyse of the malware
in contrast to just monitoring the API calls or the traces
of the APIs as above.
Classification of malware variants has been concerned by
analysts in a long period [
]. Evolving malware
generates a lot of variants and brings great challenges to
analytical work. Although these variants change in the
file format and appearance, there are still the same
behavior patterns. For example, all variants of the Allaple
worm acquire and lock of particular mutexes on infected
systems . Aiming to exploit these behavior patterns
using machine learning techniques and propose a method
which can classify the malware variants automatically
based on their behaviors. An outline of our approach is
given by the following basic steps:
1) Malware Data Acquisition. A corpus of malware
binaries are obtained by collecting the upload suspicious
files on the Kafan Forum. A Multi-Engine Online Virus
Scan system VirSCAN is applied to identify the known
2)Behavior Monitoring. Malware binaries are
executed and monitored by the HABO behavior analysis
system, which can generate detailed behavior reports.
3)Feature Extraction. Features reflect the behavior
patterns, such as process created, foreign memory regions
read, mutexes created, or registry key modified, are
extracted from the analysis reports and used to map the
malware behavior into a high-dimensional vector space.
4)Learning and classification. Back Propagation
neural network model is applied to learning and training
for the classification of the malwares.
3.1 Malware data acquisition
We have obtained up to 13600 unique samples, which are
uploaded by the extensive users of the Kafan Forum,
using for learning and subsequent classification. After
obtaining the samples, we applied the online virus scan
system VirSCAN to partition the malwares into common
families, such as Adware, Potential Unwanted
Application (PUA), Trojan/Downloader. Note that we
chose the VirSCAN instead of one Unti-virus product,
like Avira, Karpasky, to label the malware as the
VirSCAN is multi-engine and we can chose the most of
the result to label our sample. We selected 9 most
common malware categories and one Non-Malware
category on our samples. These families listed in Table 1
represent a broad range of malware categories such as
Adware, PUA, and Trojans, and the Non-Malware
category can be extend for malware detection directly in
Table 1 Malware Families Labeled by The Virscan System
3.2 Behavior Monitoring and Feature Extraction
In this section, we use the online behavior analysis
system, which called HABO, to monitor the samples’
behavior. Like most of the other online behavior analysis
systems, it can analyse the upload binaries and give you a
detail behavior report about the malware. Note that our
methodology is not bound to the HABO system; it can
also be adapted to other behavior analysis systems.
Figure 1 shows a part of behavior report of one
malware sample. It contains five main aspects, process
behaviours, file behaviours, register behaviours, net
behaviours, and other behaviours. Furthermore, these 5
main aspects contain 73 sub-behaviors which describe the
behavior of the malware in detail.
Although the reports show the detail information of
the behavior of the malware, it can’t be used for the BP
neutral network model directly, which needs the vectorial
data as the input. Hence we should extract the main
features of the malwares’ behavior from the reports firstly.
The method used here is called frequency statistics
method. The main steps are showed as follows:
1) Given all the sub-behaviors, we represent them
at a particular sequence, such as, create local thread,
enumerate process, create a new file process, and so on.
2) We made a count on the all sub-behaviors of the
malware respectively. For example, the behavior showed
in Figure 1, we can use [2, 1, 1, and 15,] to represent.
The figure 2 shows a detail behavior vector of one
malware, and the number zero means that this malware
doesn’t have the behavior correspondingly.
3.3 Learning and classification
3.3.1 Establish the Model
According to the previous description, the norm of our
behavior vector is 73. And we can use the vector, r = [r1,
r2, r3,…, r10] ri = 0 or 1, to represent the output result.
Each element value of the output vector is 0 or 1, 0
represents the malware does not belong to the
corresponding category while 1 represents the malware
belong to the corresponding category.
Given the input vector and the output vector, we
established a Back Propagation (BP) Neural Network,
which includes one input layer, one hidden layer and one
output layer. The input layer and the hidden layer both
have 73 neurons and the output layer has 10 neurons.
Additionally, the hidden and output neurons include
adjustment factor a and b respectively. The connection
weight between input layer and hidden layer, hidden
layer and output layer is noted by Wij 1 i 73,1 j 73
and Vjl 1 j 73,1 l 10 . The network is showed in
activation function of the hidden layer, and for
convenience, the input of the network, hidden neuron hj
and output neuron Ol is denoted by xi 1 i 73 ,
net hj 1 j 73
and net Ol 1 l 10 respectively.
The output of the hidden neuron hj and output neuron
Ol is out hj and out Ol . We define the error signals
of the output layer and the hidden layer as oh and hi
1 l 2
respectively. Additional, E (target(Ol ) out(Ol )) is
2 k 1
defined as output error. So according to the Back
Propagation Neural Networks algorithm, we can derive
the expression of the connection weight:
Vjl Vjl oh
\* MERGEFORMAT (1)
Wij Wij hi
\* MERGEFORMAT (2)
out, (Ol ) , hi
out, (hj )
and the symbol means the learning rate of the network.
In this paper, the transform factor between the output
layer and the hidden layer is denoted by a , and b means
the transform factor between the hidden layer and the
1) The identification of the output function
Due to there is no experienced output function, we can
only get the function by experiments. For the same
samples, when trained 10000 times, we can get the results
as shown in Table 2. And from the table, we finally
choose the linear function f x 1.0 x / 5000 .
Due to the value of the adjustment factors a and b has the
important influence on the speed of the convergence.
Figure 4 and Figure 5 show us the relationships between
the adjustment factors and the convergence time, when
the total error is set to be 0.64.
In order to evaluate the performance of our methodology,
we firstly divided the malware corpus randomly into
training and testing two partitions, and the samples sizes
are 10000 and 3600 respectively. We used the training
partition to train the BP neutral network, and used the
testing partition to measure the overall performance of
our methodology. Besides, the procedure showed above
is repeated over five independent experimental runs and
we use the average values as our final results.
The per-category accuracy for this experiment is
shown in Figure 6, and the error bars indicate the
variance measured during the experiment runs. From the
figure, we can find our average accuracy is up to 86%.
And in particular, we can find the last category, which we
defined it as the non-malware, whose predict accuracy is
up to 99%. In other words, if we used this model to detect
the binary in our corpus is whether the malware or not,
we have the correct probability approximate to 99%. This
result shows that our methodology can be easily extend
for malware detection. And more deeply, due to the
boundaries of categories 3,4,5,6 and 9 are less obvious,
which labelled as Trojan, Tr/downloader, Tr/Crypt,
Tr/dropper amd Win32, we find that the variance of these
categories are higher than other categories.
Figure 7 shows the confusion matrix for classification.
If the color of the category is deeper, it means that this
category is less error probability be classified into other
categories. From the figure, we can find that the
categories between 3 and 7 are easily be confused each
other, and the category Win32 is most likely to be
considered into other categories, which are consistent
with our actual situation.
In this paper, the behavior of the malware is captured by
the online behavior analyze system. After that, we
extracted the main feature of the malware in forms of
vector and serve it as the input of out Back Propagation
Neural Network model. Finally, by training the BP
Neural Network, we can use it to classify the malware
and detect the malware. Experimental results show that
our methodology can classify the malware variants
effectively and detect the malware accurately.
In the future work, we will focus on how to extracted
malware feature that can represent the malware more
accurately from the behavior analysis reports.
1. Rad B B , Masrom M , Ibrahim S. Camouflage in malware: from encryption to metamorphism [J]. International Journal of Computer Science and Network Security , 2012 , 12 ( 8 ): 74 - 83 .
2. Moser A , Kruegel C , Kirda E. Limits of static analysis for malware detection [C]//Computer security applications conference, 2007 . ACSAC 2007. Twenty-third annual . IEEE , 2007 : 421 - 430 .
3. Willems C , Holz T , Freiling F . Toward automated dynamic malware analysis using cwsandbox[J] . IEEE Security & Privacy , 2007 (2): 32 - 39 .
4. Egele M , Scholte T , Kirda E , et al. A survey on automated dynamic malware-analysis techniques and tools[J] . ACM Computing Surveys (CSUR) , 2012 , 44 ( 2 ): 6 .
5. Bayer U , Comparetti P M , Hlauschek C , et al. Scalable, Behavior-Based Malware Clustering [C]//NDSS. 2009 , 9 : 8 - 11 .
6. Forrest S , Hofmeyr S , Somayaji A . The evolution of system-call monitoring [C]//Computer Security Applications Conference, 2008 . ACSAC 2008 . Annual . IEEE, 2008 : 418 - 430 .
7. Irwin G W , Warwick K , Hunt K J. Neural network applications in control[M]. Iet , 1995 .
8. Rieck K , Holz T , Willems C , et al. Learning and classification of malware behavior[M]//Detection of Intrusions and Malware , and Vulnerability Assessment. Springer Berlin Heidelberg, 2008 : 108 - 125 .
9. Shaid M , Zainudeen S , Maarof M A . Malware behavior image for malware variant identification [C]//Biometrics and Security Technologies (ISBAST), 2014 International Symposium on. IEEE , 2014 : 238 - 243 .
10. Liang G , Pang J , Dai C. A Behavior-Based Malware Variant Classification Technique [J]. International Journal of Information and Education Technology , 2016 , 6 ( 4 ): 291 .
11. Park Y , Reeves D , Mulukutla V , et al. Fast malware classification by automated behavioral graph matching[C]// Proceedings of the Sixth Annual Workshop on Cyber Security and Information Intelligence Research . ACM, 2010 : 45 .
12. Cesare S , Xiang Y. Classification of malware using structured control flow[C]//Proceedings of the Eighth Australasian Symposium on Parallel and Distributed Computing-Volume 107 . Australian Computer Society, Inc., 2010 : 61 - 70 .
13. Tian R , Batten L M, Versteeg S C. Function length as a tool for malware classification [C]//Malicious and Unwanted Software , 2008 . MALWARE 2008 . 3rd International Conference on. IEEE, 2008 : 69 - 76 .
14. Islam R , Tian R , Batten L , et al. Classification of malware based on string and function feature selection[C]//Cybercrime and Trustworthy Computing Workshop (CTC), 2010 Second. IEEE, 2010 : 9 - 17 .