Privacy-preserving distributed clustering
Erkin et al. EURASIP Journal on Information Security 2013, 2013:4
http://jis.eurasipjournals.com/content/2013/1/4
R ESEA R CH
Open Access
Privacy-preserving distributed clustering
Zekeriya Erkin1* , Thijs Veugen1,2 , Tomas Toft3 and Reginald L Lagendijk1
Abstract
Clustering is a very important tool in data mining and is widely used in on-line services for medical, financial and social
environments. The main goal in clustering is to create sets of similar objects in a data set. The data set to be used for
clustering can be owned by a single entity, or in some cases, information from different databases is pooled to enrich
the data so that the merged database can improve the clustering effort. However, in either case, the content of the
database may be privacy sensitive and/or commercially valuable such that the owners may not want to share their
data with any other entity, including the service provider. Such privacy concerns lead to trust issues between entities,
which clearly damages the functioning of the service and even blocks cooperation between entities with similar data
sets. To enable joint efforts with private data, we propose a protocol for distributed clustering that limits information
leakage to the untrusted service provider that performs the clustering. To achieve this goal, we rely on cryptographic
techniques, in particular homomorphic encryption, and further improve the state of the art of processing encrypted
data in terms of efficiency by taking the distributed structure of the system into account and improving the efficiency
in terms of computation and communication by data packing. While our construction can be easily adjusted to a
centralized or a distributed computing model, we rely on a set of particular users that help the service provider with
computations. Experimental results clearly indicate that the work we present is an efficient way of deploying a
privacy-preserving clustering algorithm in a distributed manner.
1 Introduction
As a powerful tool in data mining, clustering is widely
used in several domains, including finance, medicine
and social networks, to group similar objects based on
a similarity metric. In many cases, the entity that performs the clustering operation has access to the whole
database, while in some other cases, databases from different resources are merged to improve the performance
of the clustering algorithms. A number of examples can be
given as follows:
• Social networks. Users are clustered by the service
provider based on their profile data. The clustering
result can be used for creating self-help groups or
generating recommendations. Obviously, in many
cases, users would not like to share their profile data
with anyone else but with the people that are in the
same group.
• Banking. Several banks might want to merge their
customer databases for credit card fraud detection or
*Correspondence:
1 Department of Intelligent Systems, Delft University of Technology, Delft,
2628 CD, The Netherlands
Full list of author information is available at the end of the article
to classify their users based on past transactions to
identify profitable customers.
• Medical domain. Different holders of medical
databases might be willing to pool their data for
medical research, either for scientific, economic or
marketing reasons [1]. Another case can be the
Centre for Disease Control that would like to identify
trends based on data from different insurance
companies [2].
However, regardless of the application setting with one
or more data resources, in many cases, data are privacy sensitive or commercially valuable: the data owners
might not want to reveal their sensitive data to the service provider, for instance in social networks, as the data
can be processed for other purposes, transferred to other
third parties without user consent or stolen by outsiders.
In the case of multiple data resources from different entities as in banking, the data owners might not want to take
risks in sharing their customer data with other competitors. Clearly, such privacy-related concerns might result
in several drawbacks: people not joining social networks
© 2013 Erkin et al.; licensee Springer. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction
in any medium, provided the original work is properly cited.
Erkin et al. EURASIP Journal on Information Security 2013, 2013:4
http://jis.eurasipjournals.com/content/2013/1/4
or database owners preferring to process data on their
own.
In this paper, we focus on a setting with a central entity
that provides services based on clustering of multiple
users, each one having a private preference vector. Our
goal is to prevent the service provider from learning the
privacy-sensitive data of the users, without substantially
degrading the performance of the clustering algorithm.
Thus, we focus on the following:
• Privacy. To protect the privacy of users, we encrypt
the preference vectors and provide only these
encrypted vectors to the service provider, who
does not have the decryption key. However, it is
still possible for the service provider to cluster
people using our cryptographic protocol.
Throughout the protocol, the preferences,
intermediate cluster assignments and the final
results of the clustering algorithm are all encrypted
and thus unknown to the service provider or any
other person in the network. This approach, which
has proved itself useful in the field of
privacy-enhanced technologies [3], guarantees
privacy protection to the users of the social network
without disrupting the service.
• Performance. While processing encrypted data as
explained above provides privacy protection, it also
comes with a price: expensive operations on the
encrypted data, in terms of computational and
communication costs. To improve the efficiency, we
approach this challenge in two directions:
(1) custom-tailored cryptographic protocols that use
data packing and (2) a setting in which the service
provider creates user sets and assigns additional
responsibilities to one of the users in each set to be
able to use less expensive cryptographic
sub-protocols for the computations, avoiding
expensive computations such as the ones in [4].
Moreover, having such a construction, centralized or
distributed clustering scenarios can be realized, as
discussed further below.
The service provider is defined as the entity that wants
to cluster users based on their private preference vectors.
Each user also participates in the clustering computations,
and a set of users, named helper users, are chosen randomly to perform additional tasks. As the number of user
sets increases, it becomes easier to parallelize operations
and thus achieve better performance. However, this setting with one set of users and a single helper user can
also be considered to realize clustering algori (...truncated)