FSSDroid: Feature subset selection for Android malware detection (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s11280-024-01287-y.pdf

FSSDroid: Feature subset selection for Android malware detection

World Wide Web (2024) 27:50 https://doi.org/10.1007/s11280-024-01287-y FSSDroid: Feature subset selection for Android malware detection Nikolaos Polatidis1 · Stelios Kapetanakis2,4 · Marcello Trovati3 · Ioannis Korkontzelos3 · Yannis Manolopoulos5 Received: 6 April 2024 / Revised: 27 June 2024 / Accepted: 2 July 2024 © The Author(s) 2024 Abstract Android malware has become an increasingly important threat to individuals, organizations, and society, posing significant risks to data security, privacy, and infrastructure. As malware evolves in sophistication and complexity, the detection and mitigation of these malicious software instances have become more challenging and time consuming since the required number of features to identify potential malware can be very high. To address this issue, we have developed an effective feature selection methodology for malware detection in Android. The critical concern in the field of malware detection is the complexity of algorithms and the use of features that are used to detect malware. The present paper delivers a methodology for pre-processing datasets to select the most optimal features that will allow detecting malware, while maintaining very high accuracy. The proposed methodology has been tested on two real world datasets and the results indicate that the number of features is significantly reduced from 489 to between 19 and 28 for the first dataset and from 9503 to between 9 and 27 for the second dataset, whilst the accuracy is maintained as if all features were used. Keywords Android · Malware detection · Feature selection · Machine learning · Binarization · Pre-processing B Yannis Manolopoulos Nikolaos Polatidis Stelios Kapetanakis Marcello Trovati Ioannis Korkontzelos 1 University of Brighton, Brighton BN2 4GJ, UK 2 Distributed Analytics Solutions, London E14 6FD, UK 3 Edge Hill University, Ormskirk L39 4QP, UK 4 Middlesex University, London NW4 4BT, UK 5 Open University of Cyprus, Nicosia 2220, Cyprus 0123456789().: V,-vol 123 50 Page 2 of 17 World Wide Web (2024) 27:50 1 Introduction The pervasive nature of malware poses a substantial threat to digital ecosystems worldwide. Malicious software, commonly known as malware, encompasses a broad spectrum of threats, including viruses, worms, trojans, ransomware, spyware, and others. The spread of malware attacks has escalated exponentially in recent years, resulting in severe financial losses, compromised privacy, and disrupted operations for both individuals and organizations. Consequently, detecting, and mitigating malware has become a critical priority in the cybersecurity landscape. Traditional signature-based detection systems, which rely on predefined patterns to identify known malware, are increasingly ineffective against sophisticated and polymorphic malware variants. In response, researchers have turned to machine learning (ML) algorithms and artificial intelligence (AI) techniques to develop more robust and adaptive malware detection systems. ML-based approaches leverage the power of data driven models to detect and classify malware by learning from vast amounts of labelled samples, capturing intricate patterns and behaviours that distinguish malicious software from benign programs [1–4]. Whilst ML algorithms have demonstrated remarkable success in malware detection, their effectiveness comes at the cost of using too many features, which then require more processing. Many state-of-the-art ML models, such as Deep Neural Networks (DNNs) have been used along with app features making it challenging to process and also understand how they arrive at their conclusions. Datasets used in malware detection have become increasingly complex containing a substantial number of features that can go up to thousands and contain non-binary numerical values. Moreover, when the data are complex, there is a necessity to develop more complex algorithms such as neural networks and deep learning-based algorithms, which are time and energy consuming since when new data arrive, usually they will need to be retrained. Thus, it is of high interest to the community to identify a way to use the minimum possible number of optimally selected features, which will allow using less processing intense algorithms, such as Decision Trees and Random Forest algorithms that can provide very high accuracy in smaller, balanced, binary datasets. The primary objective of this research work is to deliver a robust methodology that converts all non-binary values of an Android malware dataset into binary values, and further identify a small number of optimal features that can be used to detect malware, while maintaining high accuracy. The contributions of this article can be summarized as follows: 1. A novel methodology that provides an innovative feature selection methodology specifically tailored for Android malware detection is delivered. 2. The methodology has been evaluated using two real datasets, and the results indicate that both the number of features is highly reduced, whilst high accuracy is yet maintained. The paper is structured as follows. Section 2 presents and discusses the related work. Section 3 introduces the proposed methodology. Section 4 reports the experimental evaluation results, whereas Section 5 summarizes the conclusions and gives future work directions. 2 Related work Feature selection is an important issue for ML [5, 6]. It is noticeable that Android malware detection has been an area of active research in the last decade with several contributions 123 World Wide Web (2024) 27:50 Page 3 of 17 50 across the world. In this sequel, we present the main characteristics of these seminal contributions in chronological order. Drebin is a lightweight hybrid method, which was proposed in 2014. It uses both static and dynamic information to detect malware during runtime in an Android device. Drebin performs a broad static analysis of Android applications and automatically identifies typical patterns of malicious activities that can be presented and explained to the user. Drebin enables detecting 94% of the malware in a large dataset with few false alarms [3]. Along the same period, a relevant contribution addresses the increasing concern over information security on Android mobile devices, where user control over sensitive data is overshadowed by the proliferation of applications. Focusing on permission-based malware detection, the study analyses feature selection methods and classification algorithms. Findings indicate that Random Forest and J48 decision tree algorithms exhibit higher performance across various feature selection methods, highlighting their effectiveness in detecting malicious software in Android applications [7]. Another relevant paper addresses the prevalence of malicious applications targeting the Android platform by proposing a ML-based approach for Android malware detection. Utilizing evolutionary Genetic algorithm (GA) for feature select (...truncated)