Hierarchical adaptive evolution framework for privacy-preserving data publishing (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s11280-024-01286-z.pdf

Hierarchical adaptive evolution framework for privacy-preserving data publishing

World Wide Web https://doi.org/10.1007/s11280-024-01286-z Hierarchical adaptive evolution framework for privacy-preserving data publishing Mingshan You1 · Yong-Feng Ge1 · Kate Wang2 · Hua Wang1 · Jinli Cao3 · Georgios Kambourakis4 Received: 28 February 2024 / Revised: 10 May 2024 / Accepted: 2 July 2024 © The Author(s) 2024 Abstract The growing need for data publication and the escalating concerns regarding data privacy have led to a surge in interest in Privacy-Preserving Data Publishing (PPDP) across research, industry, and government sectors. Despite its significance, PPDP remains a challenging NP-hard problem, particularly when dealing with complex datasets, often rendering traditional traversal search methods inefficient. Evolutionary Algorithms (EAs) have emerged as a promising approach in response to this challenge, but their effectiveness, efficiency, and robustness in PPDP applications still need to be improved. This paper presents a novel Hierarchical Adaptive Evolution Framework (HAEF) that aims to optimize t-closeness anonymization through attribute generalization and record suppression using Genetic Algorithm (GA) and Differential Evolution (DE). To balance GA and DE, the first hierarchy of HAEF employs a GA-prioritized adaptive strategy enhancing exploration search. This combination aims to strike a balance between exploration and exploitation. The second hierarchy employs a random-prioritized adaptive strategy to select distinct mutation strategies, thus leveraging the advantages of various mutation strategies. Performance bencmark tests demonstrate the effectiveness and efficiency of the proposed technique. In 16 test instances, HAEF significantly outperforms traditional depth-first traversal search and exceeds the performance of previous state-of-the-art EAs on most datasets. In terms of overall performance, under the three privacy constraints tested, HAEF outperforms the conventional DFS search by an average of 47.78%, the state-of-the-art GA-based ID-DGA method by an average of 37.38%, and the hybrid GA-DE method by an average of 8.35% in TLEF. Furthermore, ablation experiments confirm the effectiveness of the various strategies within the framework. These findings enhance the efficiency of the data publishing process, ensuring privacy and security and maximizing data availability. Keywords Privacy-preserving data publishing · t-closeness anonymization · Genetic algorithm · Differential evolution · Adaptive strategy B Yong-Feng Ge Extended author information available on the last page of the article 123 World Wide Web 1 Introduction In today’s digital age, data plays a crucial role in driving innovation and decision-making, while the issue of privacy has become increasingly pertinent [1–5]. Privacy-Preserving Data Publishing (PPDP) has emerged as a paramount need, seeking to strike a delicate balance between sharing valuable information and safeguarding individuals’ sensitive data [6–9]. This practice entails the dissemination of datasets that have been carefully anonymized or transformed to protect the privacy of individuals while still allowing researchers and organizations to extract meaningful insights [10–13]. Due to a heightened awareness of privacy risks, stricter regulations related to data protection, and an understanding of the necessity of responsible data processing, the popularity of privacy-protected data is on the rise [14–16]. Preserving data privacy while maintaining data utility is indeed a challenging task, and it remains one of the most pressing and complex challenges in the field [2, 17–19]. Various approaches have emerged to address this challenge, including data anonymization, differential privacy, secure multiparty computation, and homomorphic encryption [20–23]. Among these techniques, data anonymization is widely adopted. It involves modifying or removing identifying information from a dataset to protect individual privacy. Techniques such as generalization, suppression, perturbation, or data synthesis are employed to mitigate the risk of re-identification while ensuring the usefulness of the analyzed data [24, 25]. However, despite its widespread adoption and effectiveness in enhancing privacy protection, data anonymization also faces significant computational challenges due to NP-hardness restrictions on optimization [26], making it often impractical to find an exact solution within a reasonable time frame. Some studies have introduced Evolutionary Algorithms (EAs), a common solution to NP-hard problems [27], to optimize data anonymization schemes. Among EAs, Differential Evolution (DE) and Genetic Algorithm (GA) are popular choices [28]. These algorithms find wide applications in fields such as engineering, machine learning, and bioinformatics, providing flexible and powerful optimization methods [25, 29]. DE uses differential mutation and crossover to evolve populations, emphasizing exploration. On the other hand, GA models natural selection and genetics, focusing more on exploitation. Both methods have shown effectiveness in solving complex optimization problems. This paper leverages an innovative framework that combines GA and DEs (including DE variants) for better performance and robustness in PPDP. Previous academic studies have delved into using GA and DE algorithms for data anonymization. However, it is important to note that both GA and DE have multiple variants, and the efficacy of these variants, when applied to this problem, has yet to be thoroughly tested and evaluated. The performance of these different variants on data anonymization problems can vary significantly and needs to be explored experimentally. This exploration can help identify the most effective algorithm variants for optimizing data anonymization schemes. Additionally, the performance of different algorithms on different datasets can be uneven. Therefore, there is an urgent need for the development of more robust and effective algorithms. This paper aims to enable a more practical application of EA in data anonymization processing, by improving the performance and robustness of the algorithm. We develop an effective adaptive strategy that dynamically combines the strengths of GA and DE mutation strategies, resulting in improved algorithm performance. Furthermore, we design a GA priority strategy and a random-based-DE priority strategy, which use GA and random-based-DE with greater probability in the early stage of evolution to enhance the population diversity 123 World Wide Web and explorative search. In the later stages of evolution, best-based-DE is predominantly utilized to expedite the convergence of the search process and explorative search. This paper contributes to PPDP in the following ways. • This paper introduces an innovative Hierarchical Adaptive Evolutionary Framework (HAEF), seamlessly integrating GA, DE, and variants of DE. In contrast to previous algorithms utilizing solely GA or both GA and DE with a single mutat (...truncated)