A heterogeneous graph-based semi-supervised learning framework for access control decision-making
World Wide Web
(2024) 27:35
https://doi.org/10.1007/s11280-024-01275-2
A heterogeneous graph-based semi-supervised learning
framework for access control decision-making
Jiao Yin1,2 · Guihong Chen3,4 · Wei Hong2 · Jinli Cao1 · Hua Wang2 · Yuan Miao2
Received: 17 March 2024 / Revised: 1 May 2024 / Accepted: 9 May 2024
© The Author(s) 2024
Abstract
For modern information systems, robust access control mechanisms are vital in safeguarding
data integrity and ensuring the entire system’s security. This paper proposes a novel semisupervised learning framework that leverages heterogeneous graph neural network-based
embedding to encapsulate both the intricate relationships within the organizational structure
and interactions between users and resources. Unlike existing methods focusing solely on
individual user and resource attributes, our approach embeds organizational and operational
interrelationships into the hidden layer node embeddings. These embeddings are learned
from a self-supervised link prediction task based on a constructed access control heterogeneous graph via a heterogeneous graph neural network. Subsequently, the learned node
embeddings, along with the original node features, serve as inputs for a supervised access control decision-making task, facilitating the construction of a machine-learning access control
model. Experimental results on the open-sourced Amazon access control dataset demonstrate that our proposed framework outperforms models using original or manually extracted
graph-based features from previous works. The prepossessed data and codes are available on
GitHub,facilitating reproducibility and further research endeavors.
Keywords Access control · Semi-supervised learning · Heterogeneous graph · Node
embedding · Link prediction
1 Introduction
In the contemporary era of rapid technological progress, organizations and individuals enjoy
notable benefits in terms of enhanced convenience and productivity [1]. However, technological advancement also brings forth concerns, particularly as the volume of sensitive data
and system complexity increase, prompting a growing awareness and emphasis on data privacy issues and system security protection [2–4]. Access control serves as the first line of
safeguard, mitigating the risk of unauthorized resource access or data breaches [5–8]. In
an era where information is a valuable asset, effective access control strategies contribute
significantly to organizations’ overall security posture, fostering trust among stakeholders
and ensuring compliance with regulatory requirements [9–11].
Extended author information available on the last page of the article
0123456789().: V,-vol
123
35
Page 2 of 24
World Wide Web
(2024) 27:35
Traditional role-based access control (RBAC) strategies solely assign resource access
permissions to users based on their roles, which suffer from limited context awareness and lack
of granularity [12, 13]. Consequently, RBAC often grants users more data or resource access
than necessary. On the other hand, attribute-based access control (ABAC) strategies make
authorization decisions based on attributes or characteristics of users, resources, and even
system environments rather than relying solely on roles [14]. While ABAC strategies offer
more fine-grained and flexible access control policies than RBAC, they still face challenges
due to the increasing complexity in design and implementation as the scale of users and
attributes grows [15, 16].
Some scholars have attempted to develop machine learning (ML) and deep learning (DL)
models for various applications including security [17, 18], data quality [19–22], health informatics [23–25] and access control decision-making to enhance efficiency and adaptability to
concept drifts. While partial verification of their efficiency and adaptive capabilities has been
achieved, addressing the explainability and reliability of ML/DL methods remains essential. With recent advancements in knowledge graphs (KGs), graph theory, and graph neural
networks (GNNs) [26], more scholars are turning to graph-based methods to improve the
efficiency, performance, explainability, and reliability of access control decision-making. For
instance, Morgado, C., Baioco, GB., Basso, T., et al. proposed a security model to provide
access control for NoSQL graph-oriented database management systems, preserving data
integrity and protecting against unauthorized access [27]. Shan, D., Du, X., Wang, W., et
al. introduced a critical provenance identification framework based on heterogeneous graph
neural networks (HGNNs) to address dynamic attribute generation and multi-source aggregation challenges arising from big data resources in dynamic access control scenarios [28].
Specifically, Mingshan, Y. et al. devised an algorithm to construct an access control KG from
user and resource attributes, then extracted topological features from the constructed KG
to represent high cardinality categorical user and resource attributes for building ML-based
access control models [29].
Despite the aforementioned progress, the unavailability of data and codes hinders reproducibility and comparison with traditional ML/DL models [16, 30–32]. Furthermore, existing
literature lacks discussions on the impact of different relationship types on access control decision-making performance [33–36]. This paper aims to explore the capability of
HGNNs in integrating multi-source and multi-relationship data from large-scale information
systems comprising tens of thousands of users and resources. Specifically, we propose a semisupervised learning framework based on an access control heterogeneous graph (ACHG).
Firstly, we employ a self-supervised node embedding strategy based on an HGNN link prediction task to learn node embeddings of users and resources. Subsequently, a supervised
ML model is trained as the classifier to make access control decisions, utilizing learned node
embeddings and original user and resource attributes.
The contributions of this paper are threefold:
(1) We introduce a comprehensive HGNN-based semi-supervised learning framework for
access control decision-making. This framework utilizes a self-supervised node embedding strategy to learn node embeddings from an ACHG. Subsequently, a supervised
ML model is trained from access control log files by integrating node embeddings and
original features of users and resources as the features of access requests.
(2) We conduct empirical research to explore the impact of different relationship types
and node embedding lengths of heterogeneous graphs on access control performance.
Our investigations validate insights from existing literature regarding the influence of
heterogeneity and node embedding complexity on downstream task performance. These
123
World Wide Web
(2024) 27:35
Page 3 of 24
35
findings offer valuable insights for designing and implementing future heterogeneous
graph-based applications, including access control decision- (...truncated)