FSSDroid: Feature subset selection for Android malware detection
World Wide Web
(2024) 27:50
https://doi.org/10.1007/s11280-024-01287-y
FSSDroid: Feature subset selection for Android malware
detection
Nikolaos Polatidis1 · Stelios Kapetanakis2,4 · Marcello Trovati3 ·
Ioannis Korkontzelos3 · Yannis Manolopoulos5
Received: 6 April 2024 / Revised: 27 June 2024 / Accepted: 2 July 2024
© The Author(s) 2024
Abstract
Android malware has become an increasingly important threat to individuals, organizations,
and society, posing significant risks to data security, privacy, and infrastructure. As malware
evolves in sophistication and complexity, the detection and mitigation of these malicious
software instances have become more challenging and time consuming since the required
number of features to identify potential malware can be very high. To address this issue, we
have developed an effective feature selection methodology for malware detection in Android.
The critical concern in the field of malware detection is the complexity of algorithms and the
use of features that are used to detect malware. The present paper delivers a methodology for
pre-processing datasets to select the most optimal features that will allow detecting malware,
while maintaining very high accuracy. The proposed methodology has been tested on two
real world datasets and the results indicate that the number of features is significantly reduced
from 489 to between 19 and 28 for the first dataset and from 9503 to between 9 and 27 for
the second dataset, whilst the accuracy is maintained as if all features were used.
Keywords Android · Malware detection · Feature selection · Machine learning ·
Binarization · Pre-processing
B Yannis Manolopoulos
Nikolaos Polatidis
Stelios Kapetanakis
Marcello Trovati
Ioannis Korkontzelos
1
University of Brighton, Brighton BN2 4GJ, UK
2
Distributed Analytics Solutions, London E14 6FD, UK
3
Edge Hill University, Ormskirk L39 4QP, UK
4
Middlesex University, London NW4 4BT, UK
5
Open University of Cyprus, Nicosia 2220, Cyprus
0123456789().: V,-vol
123
50
Page 2 of 17
World Wide Web
(2024) 27:50
1 Introduction
The pervasive nature of malware poses a substantial threat to digital ecosystems worldwide. Malicious software, commonly known as malware, encompasses a broad spectrum
of threats, including viruses, worms, trojans, ransomware, spyware, and others. The spread
of malware attacks has escalated exponentially in recent years, resulting in severe financial
losses, compromised privacy, and disrupted operations for both individuals and organizations. Consequently, detecting, and mitigating malware has become a critical priority in the
cybersecurity landscape. Traditional signature-based detection systems, which rely on predefined patterns to identify known malware, are increasingly ineffective against sophisticated
and polymorphic malware variants. In response, researchers have turned to machine learning (ML) algorithms and artificial intelligence (AI) techniques to develop more robust and
adaptive malware detection systems. ML-based approaches leverage the power of data driven
models to detect and classify malware by learning from vast amounts of labelled samples,
capturing intricate patterns and behaviours that distinguish malicious software from benign
programs [1–4].
Whilst ML algorithms have demonstrated remarkable success in malware detection, their
effectiveness comes at the cost of using too many features, which then require more processing. Many state-of-the-art ML models, such as Deep Neural Networks (DNNs) have been
used along with app features making it challenging to process and also understand how they
arrive at their conclusions. Datasets used in malware detection have become increasingly
complex containing a substantial number of features that can go up to thousands and contain
non-binary numerical values. Moreover, when the data are complex, there is a necessity to
develop more complex algorithms such as neural networks and deep learning-based algorithms, which are time and energy consuming since when new data arrive, usually they will
need to be retrained.
Thus, it is of high interest to the community to identify a way to use the minimum
possible number of optimally selected features, which will allow using less processing intense
algorithms, such as Decision Trees and Random Forest algorithms that can provide very high
accuracy in smaller, balanced, binary datasets.
The primary objective of this research work is to deliver a robust methodology that converts
all non-binary values of an Android malware dataset into binary values, and further identify
a small number of optimal features that can be used to detect malware, while maintaining
high accuracy. The contributions of this article can be summarized as follows:
1. A novel methodology that provides an innovative feature selection methodology specifically tailored for Android malware detection is delivered.
2. The methodology has been evaluated using two real datasets, and the results indicate that
both the number of features is highly reduced, whilst high accuracy is yet maintained.
The paper is structured as follows. Section 2 presents and discusses the related work.
Section 3 introduces the proposed methodology. Section 4 reports the experimental evaluation
results, whereas Section 5 summarizes the conclusions and gives future work directions.
2 Related work
Feature selection is an important issue for ML [5, 6]. It is noticeable that Android malware
detection has been an area of active research in the last decade with several contributions
123
World Wide Web
(2024) 27:50
Page 3 of 17
50
across the world. In this sequel, we present the main characteristics of these seminal contributions in chronological order.
Drebin is a lightweight hybrid method, which was proposed in 2014. It uses both static
and dynamic information to detect malware during runtime in an Android device. Drebin
performs a broad static analysis of Android applications and automatically identifies typical
patterns of malicious activities that can be presented and explained to the user. Drebin enables
detecting 94% of the malware in a large dataset with few false alarms [3].
Along the same period, a relevant contribution addresses the increasing concern over
information security on Android mobile devices, where user control over sensitive data is
overshadowed by the proliferation of applications. Focusing on permission-based malware
detection, the study analyses feature selection methods and classification algorithms. Findings indicate that Random Forest and J48 decision tree algorithms exhibit higher performance
across various feature selection methods, highlighting their effectiveness in detecting malicious software in Android applications [7].
Another relevant paper addresses the prevalence of malicious applications targeting the
Android platform by proposing a ML-based approach for Android malware detection. Utilizing evolutionary Genetic algorithm (GA) for feature select (...truncated)