Hierarchical adaptive evolution framework for privacy-preserving data publishing
World Wide Web
https://doi.org/10.1007/s11280-024-01286-z
Hierarchical adaptive evolution framework
for privacy-preserving data publishing
Mingshan You1 · Yong-Feng Ge1 · Kate Wang2 · Hua Wang1 · Jinli Cao3 ·
Georgios Kambourakis4
Received: 28 February 2024 / Revised: 10 May 2024 / Accepted: 2 July 2024
© The Author(s) 2024
Abstract
The growing need for data publication and the escalating concerns regarding data privacy have
led to a surge in interest in Privacy-Preserving Data Publishing (PPDP) across research, industry, and government sectors. Despite its significance, PPDP remains a challenging NP-hard
problem, particularly when dealing with complex datasets, often rendering traditional traversal search methods inefficient. Evolutionary Algorithms (EAs) have emerged as a promising
approach in response to this challenge, but their effectiveness, efficiency, and robustness
in PPDP applications still need to be improved. This paper presents a novel Hierarchical
Adaptive Evolution Framework (HAEF) that aims to optimize t-closeness anonymization
through attribute generalization and record suppression using Genetic Algorithm (GA) and
Differential Evolution (DE). To balance GA and DE, the first hierarchy of HAEF employs
a GA-prioritized adaptive strategy enhancing exploration search. This combination aims
to strike a balance between exploration and exploitation. The second hierarchy employs a
random-prioritized adaptive strategy to select distinct mutation strategies, thus leveraging
the advantages of various mutation strategies. Performance bencmark tests demonstrate the
effectiveness and efficiency of the proposed technique. In 16 test instances, HAEF significantly outperforms traditional depth-first traversal search and exceeds the performance of
previous state-of-the-art EAs on most datasets. In terms of overall performance, under the
three privacy constraints tested, HAEF outperforms the conventional DFS search by an average of 47.78%, the state-of-the-art GA-based ID-DGA method by an average of 37.38%,
and the hybrid GA-DE method by an average of 8.35% in TLEF. Furthermore, ablation
experiments confirm the effectiveness of the various strategies within the framework. These
findings enhance the efficiency of the data publishing process, ensuring privacy and security
and maximizing data availability.
Keywords Privacy-preserving data publishing · t-closeness anonymization · Genetic
algorithm · Differential evolution · Adaptive strategy
B Yong-Feng Ge
Extended author information available on the last page of the article
123
World Wide Web
1 Introduction
In today’s digital age, data plays a crucial role in driving innovation and decision-making,
while the issue of privacy has become increasingly pertinent [1–5]. Privacy-Preserving Data
Publishing (PPDP) has emerged as a paramount need, seeking to strike a delicate balance
between sharing valuable information and safeguarding individuals’ sensitive data [6–9].
This practice entails the dissemination of datasets that have been carefully anonymized or
transformed to protect the privacy of individuals while still allowing researchers and organizations to extract meaningful insights [10–13]. Due to a heightened awareness of privacy
risks, stricter regulations related to data protection, and an understanding of the necessity of
responsible data processing, the popularity of privacy-protected data is on the rise [14–16].
Preserving data privacy while maintaining data utility is indeed a challenging task, and
it remains one of the most pressing and complex challenges in the field [2, 17–19]. Various
approaches have emerged to address this challenge, including data anonymization, differential privacy, secure multiparty computation, and homomorphic encryption [20–23]. Among
these techniques, data anonymization is widely adopted. It involves modifying or removing identifying information from a dataset to protect individual privacy. Techniques such
as generalization, suppression, perturbation, or data synthesis are employed to mitigate the
risk of re-identification while ensuring the usefulness of the analyzed data [24, 25]. However, despite its widespread adoption and effectiveness in enhancing privacy protection, data
anonymization also faces significant computational challenges due to NP-hardness restrictions on optimization [26], making it often impractical to find an exact solution within a
reasonable time frame.
Some studies have introduced Evolutionary Algorithms (EAs), a common solution to
NP-hard problems [27], to optimize data anonymization schemes. Among EAs, Differential
Evolution (DE) and Genetic Algorithm (GA) are popular choices [28]. These algorithms
find wide applications in fields such as engineering, machine learning, and bioinformatics,
providing flexible and powerful optimization methods [25, 29]. DE uses differential mutation
and crossover to evolve populations, emphasizing exploration. On the other hand, GA models
natural selection and genetics, focusing more on exploitation. Both methods have shown
effectiveness in solving complex optimization problems. This paper leverages an innovative
framework that combines GA and DEs (including DE variants) for better performance and
robustness in PPDP.
Previous academic studies have delved into using GA and DE algorithms for data
anonymization. However, it is important to note that both GA and DE have multiple variants,
and the efficacy of these variants, when applied to this problem, has yet to be thoroughly
tested and evaluated. The performance of these different variants on data anonymization
problems can vary significantly and needs to be explored experimentally. This exploration
can help identify the most effective algorithm variants for optimizing data anonymization
schemes. Additionally, the performance of different algorithms on different datasets can be
uneven. Therefore, there is an urgent need for the development of more robust and effective
algorithms.
This paper aims to enable a more practical application of EA in data anonymization
processing, by improving the performance and robustness of the algorithm. We develop an
effective adaptive strategy that dynamically combines the strengths of GA and DE mutation
strategies, resulting in improved algorithm performance. Furthermore, we design a GA priority strategy and a random-based-DE priority strategy, which use GA and random-based-DE
with greater probability in the early stage of evolution to enhance the population diversity
123
World Wide Web
and explorative search. In the later stages of evolution, best-based-DE is predominantly utilized to expedite the convergence of the search process and explorative search. This paper
contributes to PPDP in the following ways.
• This paper introduces an innovative Hierarchical Adaptive Evolutionary Framework
(HAEF), seamlessly integrating GA, DE, and variants of DE. In contrast to previous
algorithms utilizing solely GA or both GA and DE with a single mutat (...truncated)