Keeping pace with the creation of new malicious PDF files using an active-learning based detection framework

Security Informatics, Feb 2016

Attackers increasingly take advantage of naive users who tend to treat non-executable files casually, as if they are benign. Such users often open non-executable files although they can conceal and perform malicious operations. Existing defensive solutions currently used by organizations prevent executable files from entering organizational networks via web browsers or email messages. Therefore, recent advanced persistent threat attacks tend to leverage non-executable files such as portable document format (PDF) documents which are used daily by organizations. Machine Learning (ML) methods have recently been applied to detect malicious PDF files, however these techniques lack an essential element—they cannot be efficiently updated daily. In this study we present an active learning (AL) based framework, specifically designed to efficiently assist anti-virus vendors focus their analytical efforts aimed at acquiring novel malicious content. This focus is accomplished by identifying and acquiring both new PDF files that are most likely malicious and informative benign PDF documents. These files are used for retraining and enhancing the knowledge stores of both the detection model and anti-virus. We propose two AL based methods: exploitation and combination. Our methods are evaluated and compared to existing AL method (SVM-margin) and to random sampling for 10 days, and results indicate that on the last day of the experiment, combination outperformed all of the other methods, enriching the signature repository of the anti-virus with almost seven times more new malicious PDF files, while each day improving the detection model’s capabilities further. At the same time, it dramatically reduces security experts’ efforts by 75 %. Despite this significant reduction, results also indicate that our framework better detects new malicious PDF files than leading anti-virus tools commonly used by organizations for protection against malicious PDF files.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://link.springer.com/content/pdf/10.1186%2Fs13388-016-0026-3.pdf

Keeping pace with the creation of new malicious PDF files using an active-learning based detection framework

Nissim et al. Secur Inform Keeping pace with the creation of new malicious PDF files using an active-learning based detection framework Nir Nissim 0 Aviad Cohen 0 Robert Moskovitch 1 Asaf Shabtai 0 Matan Edri 0 Oren BarAd 0 Yuval Elovici 0 0 Department of Information Systems Engineering, Ben-Gurion University of the Negev , Beersheba , Israel 1 Department of Biomedical Informatics, Columbia University , New York , USA Attackers increasingly take advantage of naive users who tend to treat non-executable files casually, as if they are benign. Such users often open non-executable files although they can conceal and perform malicious operations. Existing defensive solutions currently used by organizations prevent executable files from entering organizational networks via web browsers or email messages. Therefore, recent advanced persistent threat attacks tend to leverage non-executable files such as portable document format (PDF) documents which are used daily by organizations. Machine Learning (ML) methods have recently been applied to detect malicious PDF files, however these techniques lack an essential element-they cannot be efficiently updated daily. In this study we present an active learning (AL) based framework, specifically designed to efficiently assist anti-virus vendors focus their analytical efforts aimed at acquiring novel malicious content. This focus is accomplished by identifying and acquiring both new PDF files that are most likely malicious and informative benign PDF documents. These files are used for retraining and enhancing the knowledge stores of both the detection model and anti-virus. We propose two AL based methods: exploitation and combination. Our methods are evaluated and compared to existing AL method (SVM-margin) and to random sampling for 10 days, and results indicate that on the last day of the experiment, combination outperformed all of the other methods, enriching the signature repository of the anti-virus with almost seven times more new malicious PDF files, while each day improving the detection model's capabilities further. At the same time, it dramatically reduces security experts' efforts by 75 %. Despite this significant reduction, results also indicate that our framework better detects new malicious PDF files than leading anti-virus tools commonly used by organizations for protection against malicious PDF files. Active learning; Machine learning; PDF; Malware Introduction Cyber-attacks aimed at organizations have increased since 2009, with 91  % of all organizations hit by cyberattacks in 2013.1 Attacks aimed at organizations usually include harmful activities such as stealing confidential information, spying and monitoring an organization, and 1 http://www.humanipo.com/news/37983/91-of-organisations-hit-by-cyber attacks-in-2013/. disrupting an organization’s actions. Attackers may be motivated by ideology, criminal intent, a desire for publicity, and more. The vast majority of organizations rely heavily on email for internal and external communication. Thus, email has become a very attractive platform from which to initiate cyber-attacks against organizations. Attackers often use social engineering in order to encourage recipients to press a link or open a malicious web page or attachment. According to Trend Micro,2 attacks, particularly those against government agencies 2 http://www.infosecurity-magazine.com/view/29562/91-of-apt-attacksstart-with-a-spearphishing-email/. © 2016 Nissim et al. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. and large corporations, are largely dependent upon Spear-Phishing3 emails. An incident in 2014 aimed at the Israeli ministry of defense (IMOD) provides an example of a new type of targeted cyber-attack involving non-executable files attached to an email. According to media reports,4 the attackers posed as IMOD representatives and sent email messages with a malicious portable document format (PDF) file attachment which, when opened, installed a Trojan horse enabling the attacker to control the computer. Non-executable files attached to an email are a component of many recent cyber-attacks as well. This type of attack has grown in popularity, in part because executable files (e.g., *.EXE) attached to emails are filtered by most email servers due to the risk they pose and also because non-executables (e.g., *.PDF, *.DOC, etc.) are not filtered out and are considered safe by most users. Nonexecutable files are written in a format that can be read only by a program that is specifically designed for that purpose and often cannot be directly executed. For example, a PDF file can be read only by (...truncated)


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1186%2Fs13388-016-0026-3.pdf

Nir Nissim, Aviad Cohen, Robert Moskovitch, Asaf Shabtai, Matan Edri, Oren BarAd, Yuval Elovici. Keeping pace with the creation of new malicious PDF files using an active-learning based detection framework, Security Informatics, 2016, pp. 1, Volume 5, Issue 1, DOI: 10.1186/s13388-016-0026-3