Personalized federated learning with model interpolation among client clusters and its application in smart home
World Wide Web
https://doi.org/10.1007/s11280-022-01132-0
Personalized federated learning with model interpolation
among client clusters and its application in smart home
Zhikai Yang1 · Yaping Liu1 · Shuo Zhang1 · Keshen Zhou2
Received: 17 October 2022 / Revised: 27 November 2022 / Accepted: 8 December 2022
© The Author(s) 2023
Abstract
The proliferation of high-performance personal devices and the widespread deployment of
machine learning (ML) applications have led to two consequences: the volume of private
data from individuals or groups has exploded over the past few years; and the traditional
central servers for training ML models have experienced communication and performance
bottlenecks in the face of massive amounts of data. However, this reality also provides the
possibility of keeping data local for ML training and fusing models on a broader scale. As
a new branch of ML application, Federated Learning (FL) aims to solve the problem of
multi-party joint learning on the premise of protecting personal data privacy. However, due
to the heterogeneity of devices, including network connection, network bandwidth, computing resources, etc., it is unrealistic to train, update and aggregate models in all devices
in parallel, while personal data is often not independent and identically distributed (NonIID) due to multiple reasons. This reality poses a challenge to the speed and convergence of
FL. In this paper, we propose the pFedCAM algorithm, which aims to improve the robustness of the FL system to device heterogeneity and Non-IID data, while achieving some
degree of federation model personalization. pFedCAM is based on the idea of clustering
and model interpolation by classifying heterogeneous clients and performing FedAvg
algorithm in parallel, and then combining them into personalized federated global models by inter-cluster model interpolation. Experiments show that the accuracy of pFedCAM
improves 10.3% on Fashion-MNIST and 11.3% on CIFAR-10 compared to the benchmark
in the case of Non-IID data. In the end, we applied pFedCAM in HomeProtect, a smart
home privacy protection framework we designed, and achieved good practical results in the
case of flame recognition.
* Yaping Liu
* Shuo Zhang
Zhikai Yang
Keshen Zhou
1
Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou, China
2
School of Computer Science, The University of Sydney, Sydney, Australia
13
Vol.:(0123456789)
World Wide Web
Keywords Federated Learning · Personalization · Clustering · Model interpolation
1 Introduction
In the past few years, a large number of devices with different computing capabilities have
been put into the market, such as mobile devices like smartphones, Internet of Things (IOT)
devices, and smart cars. These devices have generated a large amount of data due to their
extensive and long-term use. These data are very attractive for data-driven machine learning (ML), and will contribute to the training of ML models. However, the traditional way
of centralized ML is to upload personal data to a central server for model training, which
will compromise the privacy of individuals. The EU introduced the General Data Protection Regulation (GDPR) in 2018, which is a privacy protection regulation designed to set
out the rules that companies should follow when collecting, processing and using users’
data. With the gradual implementation of privacy protection policies in various countries
and the gradual awakening of people’s awareness of privacy protection, the method of collecting data, uploading it to servers and training it no longer applies. Google provides us
with an effective distributed ML paradigm. In 2016, Google [1] proposed the concept of
Federated Learning (FL) and successfully applied it to Google keyboard [2], providing a
powerful tool to break the barrier of data silos. With FL, instead of uploading data, the data
owner will upload the ML models obtained using local computing resources to the server,
which will aggregate the models. Because of the privacy-sensitive data protection feature,
FL is widely used in the field of privacy-preserved ML, such as financial lending, medical
diagnosis [3, 4], etc. If it is based on existing blockchain technologies and applications,
such as data auditing [5] and energy dispatching [6], FL will have a broader application
prospect and its privacy protection features will be strengthened.
However, unlike distributed ML based on server environment, FL is built on a more complex device environment and it faces some fundamental challenges. Since the devices of individuals or groups participating in the FL system have different computing resources, network
bandwidth and network connectivity, and the availability of these devices are not stable at all
times due to non-hardware factors such as usage habits, it is very difficult to design synchronous or semi-synchronous protocols as in the case of traditional distributed ML. For these
reasons, it is a common strategy to select some clients but not all to participate in training in
order to avoid the FL system from getting into long-time waiting due to Some devices being
offline or unstable network conditions. McMahan et al. [1] proposed FedAvg algorithm,
which randomly selects a certain proportion of clients to upload model weights at the end of
each round of local training, and then the server averages the weights. FedCS [7] selects the
appropriate clients by measuring the client resources, and accommodates as many clients as
possible to participate in the aggregation without entering a long wait.
In addition to the device heterogeneity challenge, FL also faces the statistical heterogeneity challenge. Most of the existing FL algorithms do not consider the statistical challenges posed by heterogeneous local datasets in a global sense. Due to the heterogeneity of
devices and different user usage patterns, individual data may have attribute skew or label
skew, that is, data from different clients may not come from the same global distribution,
and the model trained by selecting from part of clients may not reflect the overall data distribution, leading to the introduction of unavoidable bias in the update of the global model.
Device heterogeneity and statistical heterogeneity cause the problem of not independent
and identically distributed (Non-IID) data in FL. Several studies [8–10] have shown that in
13
World Wide Web
the case of Non-IID data, there is a significant decrease in model convergence speed and
accuracy with the resulting increase in the number of communication rounds in FL. Considering the FedAvg algorithm under Non-IID data, since clients use Non-IID data in local
training, the variation between models trained by different clients is too large, resulting in
slowing down the global model convergence speed and significantly reducing the model
accuracy in the model aggregation phase.
The generation of device heterogeneity and Non-IID data problems on the (...truncated)