A heterogeneous graph-based semi-supervised learning framework for access control decision-making (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s11280-024-01275-2.pdf

A heterogeneous graph-based semi-supervised learning framework for access control decision-making

World Wide Web (2024) 27:35 https://doi.org/10.1007/s11280-024-01275-2 A heterogeneous graph-based semi-supervised learning framework for access control decision-making Jiao Yin1,2 · Guihong Chen3,4 · Wei Hong2 · Jinli Cao1 · Hua Wang2 · Yuan Miao2 Received: 17 March 2024 / Revised: 1 May 2024 / Accepted: 9 May 2024 © The Author(s) 2024 Abstract For modern information systems, robust access control mechanisms are vital in safeguarding data integrity and ensuring the entire system’s security. This paper proposes a novel semisupervised learning framework that leverages heterogeneous graph neural network-based embedding to encapsulate both the intricate relationships within the organizational structure and interactions between users and resources. Unlike existing methods focusing solely on individual user and resource attributes, our approach embeds organizational and operational interrelationships into the hidden layer node embeddings. These embeddings are learned from a self-supervised link prediction task based on a constructed access control heterogeneous graph via a heterogeneous graph neural network. Subsequently, the learned node embeddings, along with the original node features, serve as inputs for a supervised access control decision-making task, facilitating the construction of a machine-learning access control model. Experimental results on the open-sourced Amazon access control dataset demonstrate that our proposed framework outperforms models using original or manually extracted graph-based features from previous works. The prepossessed data and codes are available on GitHub,facilitating reproducibility and further research endeavors. Keywords Access control · Semi-supervised learning · Heterogeneous graph · Node embedding · Link prediction 1 Introduction In the contemporary era of rapid technological progress, organizations and individuals enjoy notable benefits in terms of enhanced convenience and productivity [1]. However, technological advancement also brings forth concerns, particularly as the volume of sensitive data and system complexity increase, prompting a growing awareness and emphasis on data privacy issues and system security protection [2–4]. Access control serves as the first line of safeguard, mitigating the risk of unauthorized resource access or data breaches [5–8]. In an era where information is a valuable asset, effective access control strategies contribute significantly to organizations’ overall security posture, fostering trust among stakeholders and ensuring compliance with regulatory requirements [9–11]. Extended author information available on the last page of the article 0123456789().: V,-vol 123 35 Page 2 of 24 World Wide Web (2024) 27:35 Traditional role-based access control (RBAC) strategies solely assign resource access permissions to users based on their roles, which suffer from limited context awareness and lack of granularity [12, 13]. Consequently, RBAC often grants users more data or resource access than necessary. On the other hand, attribute-based access control (ABAC) strategies make authorization decisions based on attributes or characteristics of users, resources, and even system environments rather than relying solely on roles [14]. While ABAC strategies offer more fine-grained and flexible access control policies than RBAC, they still face challenges due to the increasing complexity in design and implementation as the scale of users and attributes grows [15, 16]. Some scholars have attempted to develop machine learning (ML) and deep learning (DL) models for various applications including security [17, 18], data quality [19–22], health informatics [23–25] and access control decision-making to enhance efficiency and adaptability to concept drifts. While partial verification of their efficiency and adaptive capabilities has been achieved, addressing the explainability and reliability of ML/DL methods remains essential. With recent advancements in knowledge graphs (KGs), graph theory, and graph neural networks (GNNs) [26], more scholars are turning to graph-based methods to improve the efficiency, performance, explainability, and reliability of access control decision-making. For instance, Morgado, C., Baioco, GB., Basso, T., et al. proposed a security model to provide access control for NoSQL graph-oriented database management systems, preserving data integrity and protecting against unauthorized access [27]. Shan, D., Du, X., Wang, W., et al. introduced a critical provenance identification framework based on heterogeneous graph neural networks (HGNNs) to address dynamic attribute generation and multi-source aggregation challenges arising from big data resources in dynamic access control scenarios [28]. Specifically, Mingshan, Y. et al. devised an algorithm to construct an access control KG from user and resource attributes, then extracted topological features from the constructed KG to represent high cardinality categorical user and resource attributes for building ML-based access control models [29]. Despite the aforementioned progress, the unavailability of data and codes hinders reproducibility and comparison with traditional ML/DL models [16, 30–32]. Furthermore, existing literature lacks discussions on the impact of different relationship types on access control decision-making performance [33–36]. This paper aims to explore the capability of HGNNs in integrating multi-source and multi-relationship data from large-scale information systems comprising tens of thousands of users and resources. Specifically, we propose a semisupervised learning framework based on an access control heterogeneous graph (ACHG). Firstly, we employ a self-supervised node embedding strategy based on an HGNN link prediction task to learn node embeddings of users and resources. Subsequently, a supervised ML model is trained as the classifier to make access control decisions, utilizing learned node embeddings and original user and resource attributes. The contributions of this paper are threefold: (1) We introduce a comprehensive HGNN-based semi-supervised learning framework for access control decision-making. This framework utilizes a self-supervised node embedding strategy to learn node embeddings from an ACHG. Subsequently, a supervised ML model is trained from access control log files by integrating node embeddings and original features of users and resources as the features of access requests. (2) We conduct empirical research to explore the impact of different relationship types and node embedding lengths of heterogeneous graphs on access control performance. Our investigations validate insights from existing literature regarding the influence of heterogeneity and node embedding complexity on downstream task performance. These 123 World Wide Web (2024) 27:35 Page 3 of 24 35 findings offer valuable insights for designing and implementing future heterogeneous graph-based applications, including access control decision- (...truncated)