Privacy-preserving data publishing: an information-driven distributed genetic algorithm
World Wide Web (2024) 27:1
https://doi.org/10.1007/s11280-024-01241-y
Privacy-preserving data publishing: an information-driven
distributed genetic algorithm
Yong-Feng Ge1 · Hua Wang1 · Jinli Cao2 · Yanchun Zhang3,4,5 · Xiaohong Jiang6
Received: 30 March 2023 / Revised: 29 August 2023 / Accepted: 6 December 2023 /
Published online: 15 January 2024
© The Author(s) 2024
Abstract
The privacy-preserving data publishing (PPDP) problem has gained substantial attention from
research communities, industries, and governments due to the increasing requirements for
data publishing and concerns about data privacy. However, achieving a balance between preserving privacy and maintaining data quality remains a challenging task in PPDP. This paper
presents an information-driven distributed genetic algorithm (ID-DGA) that aims to achieve
optimal anonymization through attribute generalization and record suppression. The proposed
algorithm incorporates various components, including an information-driven crossover operator, an information-driven mutation operator, an information-driven improvement operator,
and a two-dimensional selection operator. Furthermore, a distributed population model is utilized to improve population diversity while reducing the running time. Experimental results
confirm the superiority of ID-DGA in terms of solution accuracy, convergence speed, and
the effectiveness of all the proposed components.
Keywords Evolutionary computation · Data privacy and utility · Data publishing ·
Distributed algorithm
1 Introduction
In the present era, data assumes a critical role in the daily lives of individuals [1–6]. The dissemination and utilization of data [7–13] have created enormous opportunities for decision-
This article belongs to the Topical Collection: Special Issue on Web Information Systems Engineering 2022
Guest Editors: Richard Chbeir, Helen Huang, Yannis Manolopoulos and Fabrizio Silvestri.
B
Yong-Feng Ge
1
Victoria University, Melbourne, Australia
2
La Trobe University, Melbourne, Australia
3
School of Computer Science and Technology, Zhejiang Normal University, Jinhua, Zhejiang, China
4
The Department of New Networks, Peng Cheng Laboratory, Shenzhen, Guangdong, China
5
Institute for Sustainable Industries and Liveable Cities, Victoria University, Melbourne, Victoria,
Australia
6
Future University Hakodate, Hakodate, Japan
123
1 Page 2 of 21
World Wide Web (2024) 27:1
making and knowledge exploration [5, 14–23]. For instance, in 2006, Netflix released
a dataset comprising 100 million movie ratings to enhance its recommendation system’s performance [24]. However, despite the significant advantages of data publication, concerns on
data privacy preservation [25–31]. Consequently, privacy-preserving data publishing (PPDP)
has emerged as a critical area of research, which aims to create an anonymous dataset that safeguards privacy while maintaining optimal data utility levels. This objective can be achieved
through various privacy-preserving techniques such as data anonymization, generalization,
and perturbation.
When it comes to PPDP, two main categories of approaches exist: decreasing the precision
of the original dataset, and data perturbation [16]. In the first category, a well-known approach
was introduced in [32] that uses a binary search on the generalization lattice to identify the
anonymization solution. Kohlmayer et al. [33] presented a comprehensive framework for
optimal anonymization, which enabled the Flash algorithm to find the optimal anonymization solution by searching the path in the lattice. An algorithm proposed in [34] optimized
the anonymization solution in an identical generalization hierarchy, which is useful for protecting data privacy in the general Internet of Things (IoT) environment. However, existing
works mostly focus on single anonymization operations (such as attribute generalization
or record suppression), which may not be effective from the perspective of information
release. Therefore, it is worth considering combining multiple anonymization operations
when optimizing the anonymization solution. Moreover, existing works mostly adopt graph
search-based strategies to optimize the anonymization solution, but these approaches may
lose their effectiveness when the search space of the PPDP problem becomes complex.
Ge et al. [35] formulated the multi-objective data publishing problem and proposed a distributed cooperative coevolution evolutionary framework to achieve efficient optimization.
In the second category, differential privacy represents one of the typical approaches that
ensure no significant difference in query results when inserting one record [36, 37]. These
approaches are effective in addressing data privacy requirements in queries. However, they
are not suitable for scenarios requiring data transparency and truthfulness.
The genetic algorithm (GA), as discussed in previous research [38–40], is an algorithmic approach that involves a stochastic search mechanism based on the principles of natural
competition and selection [41–43]. By utilizing a population model, GA is able to maintain a diverse search direction and facilitate the production of high-quality solutions. The
widespread use of GA in various optimization problems [44–47] can be attributed to its
advantages in high search efficiency and robustness.
This paper presents the information-driven distributed genetic algorithm (ID-DGA). The
proposed algorithm optimizes anonymization solutions using a combination of attribute
generalization and record suppression techniques. ID-DGA is designed based on a distributed population model to improve population diversity. Besides, ID-DGA incorporates a
specifically designed information-driven crossover operator that facilitates the exchange of
information between anonymization solutions and promotes information release. In addition,
ID-DGA employs an information-driven mutation operator to enhance population diversity
and information release. Furthermore, the proposed information-driven improvement operator helps adaptively refine the anonymization solutions. Finally, a two-dimensional selection
operator is introduced to enhance individual competitiveness and population quality.
The paper is structured as follows. Section 2 provides an overview of the related work in the
field of PPDP. Section 3 formally defines the PPDP problem. Section 4 presents the proposed
ID-DGA in detail. Sections 5 and 6 outline the experimental setup used in this study and
present an analysis of the experimental results. Finally, Section 7 offers concluding remarks
to wrap up the paper.
123
World Wide Web (2024) 27:1
Page 3 of 21 1
2 Related work
In [16], a survey regarding PPDP was presented. In this survey, the related techniques of
PPDP were systematically summarized. These techniques were designed according to four
attack models, i.e., record linkage, attributed linkage, table linkage, and probabilistic attack.
Moreover, the (...truncated)