Privacy-preserving data publishing: an information-driven distributed genetic algorithm (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s11280-024-01241-y.pdf

Privacy-preserving data publishing: an information-driven distributed genetic algorithm

World Wide Web (2024) 27:1 https://doi.org/10.1007/s11280-024-01241-y Privacy-preserving data publishing: an information-driven distributed genetic algorithm Yong-Feng Ge1 · Hua Wang1 · Jinli Cao2 · Yanchun Zhang3,4,5 · Xiaohong Jiang6 Received: 30 March 2023 / Revised: 29 August 2023 / Accepted: 6 December 2023 / Published online: 15 January 2024 © The Author(s) 2024 Abstract The privacy-preserving data publishing (PPDP) problem has gained substantial attention from research communities, industries, and governments due to the increasing requirements for data publishing and concerns about data privacy. However, achieving a balance between preserving privacy and maintaining data quality remains a challenging task in PPDP. This paper presents an information-driven distributed genetic algorithm (ID-DGA) that aims to achieve optimal anonymization through attribute generalization and record suppression. The proposed algorithm incorporates various components, including an information-driven crossover operator, an information-driven mutation operator, an information-driven improvement operator, and a two-dimensional selection operator. Furthermore, a distributed population model is utilized to improve population diversity while reducing the running time. Experimental results confirm the superiority of ID-DGA in terms of solution accuracy, convergence speed, and the effectiveness of all the proposed components. Keywords Evolutionary computation · Data privacy and utility · Data publishing · Distributed algorithm 1 Introduction In the present era, data assumes a critical role in the daily lives of individuals [1–6]. The dissemination and utilization of data [7–13] have created enormous opportunities for decision- This article belongs to the Topical Collection: Special Issue on Web Information Systems Engineering 2022 Guest Editors: Richard Chbeir, Helen Huang, Yannis Manolopoulos and Fabrizio Silvestri. B Yong-Feng Ge 1 Victoria University, Melbourne, Australia 2 La Trobe University, Melbourne, Australia 3 School of Computer Science and Technology, Zhejiang Normal University, Jinhua, Zhejiang, China 4 The Department of New Networks, Peng Cheng Laboratory, Shenzhen, Guangdong, China 5 Institute for Sustainable Industries and Liveable Cities, Victoria University, Melbourne, Victoria, Australia 6 Future University Hakodate, Hakodate, Japan 123 1 Page 2 of 21 World Wide Web (2024) 27:1 making and knowledge exploration [5, 14–23]. For instance, in 2006, Netflix released a dataset comprising 100 million movie ratings to enhance its recommendation system’s performance [24]. However, despite the significant advantages of data publication, concerns on data privacy preservation [25–31]. Consequently, privacy-preserving data publishing (PPDP) has emerged as a critical area of research, which aims to create an anonymous dataset that safeguards privacy while maintaining optimal data utility levels. This objective can be achieved through various privacy-preserving techniques such as data anonymization, generalization, and perturbation. When it comes to PPDP, two main categories of approaches exist: decreasing the precision of the original dataset, and data perturbation [16]. In the first category, a well-known approach was introduced in [32] that uses a binary search on the generalization lattice to identify the anonymization solution. Kohlmayer et al. [33] presented a comprehensive framework for optimal anonymization, which enabled the Flash algorithm to find the optimal anonymization solution by searching the path in the lattice. An algorithm proposed in [34] optimized the anonymization solution in an identical generalization hierarchy, which is useful for protecting data privacy in the general Internet of Things (IoT) environment. However, existing works mostly focus on single anonymization operations (such as attribute generalization or record suppression), which may not be effective from the perspective of information release. Therefore, it is worth considering combining multiple anonymization operations when optimizing the anonymization solution. Moreover, existing works mostly adopt graph search-based strategies to optimize the anonymization solution, but these approaches may lose their effectiveness when the search space of the PPDP problem becomes complex. Ge et al. [35] formulated the multi-objective data publishing problem and proposed a distributed cooperative coevolution evolutionary framework to achieve efficient optimization. In the second category, differential privacy represents one of the typical approaches that ensure no significant difference in query results when inserting one record [36, 37]. These approaches are effective in addressing data privacy requirements in queries. However, they are not suitable for scenarios requiring data transparency and truthfulness. The genetic algorithm (GA), as discussed in previous research [38–40], is an algorithmic approach that involves a stochastic search mechanism based on the principles of natural competition and selection [41–43]. By utilizing a population model, GA is able to maintain a diverse search direction and facilitate the production of high-quality solutions. The widespread use of GA in various optimization problems [44–47] can be attributed to its advantages in high search efficiency and robustness. This paper presents the information-driven distributed genetic algorithm (ID-DGA). The proposed algorithm optimizes anonymization solutions using a combination of attribute generalization and record suppression techniques. ID-DGA is designed based on a distributed population model to improve population diversity. Besides, ID-DGA incorporates a specifically designed information-driven crossover operator that facilitates the exchange of information between anonymization solutions and promotes information release. In addition, ID-DGA employs an information-driven mutation operator to enhance population diversity and information release. Furthermore, the proposed information-driven improvement operator helps adaptively refine the anonymization solutions. Finally, a two-dimensional selection operator is introduced to enhance individual competitiveness and population quality. The paper is structured as follows. Section 2 provides an overview of the related work in the field of PPDP. Section 3 formally defines the PPDP problem. Section 4 presents the proposed ID-DGA in detail. Sections 5 and 6 outline the experimental setup used in this study and present an analysis of the experimental results. Finally, Section 7 offers concluding remarks to wrap up the paper. 123 World Wide Web (2024) 27:1 Page 3 of 21 1 2 Related work In [16], a survey regarding PPDP was presented. In this survey, the related techniques of PPDP were systematically summarized. These techniques were designed according to four attack models, i.e., record linkage, attributed linkage, table linkage, and probabilistic attack. Moreover, the (...truncated)