RoleSim*: Scaling axiomatic role-based similarity ranking on large graphs (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s11280-021-00925-z.pdf

RoleSim*: Scaling axiomatic role-based similarity ranking on large graphs

World Wide Web https://doi.org/10.1007/s11280-021-00925-z RoleSim*: Scaling axiomatic role-based similarity ranking on large graphs Weiren Yu1,2 · Sima Iranmanesh2 · Aparajita Haldar2 · Maoyin Zhang1 · Hakan Ferhatosmanoglu2 Received: 4 February 2021 / Revised: 28 May 2021 / Accepted: 7 July 2021 / © The Author(s) 2021 Abstract RoleSim and SimRank are among the popular graph-theoretic similarity measures with many applications in, e.g., web search, collaborative filtering, and sociometry. While RoleSim addresses the automorphic (role) equivalence of pairwise similarity which SimRank lacks, it ignores the neighboring similarity information out of the automorphically equivalent set. Consequently, two pairs of nodes, which are not automorphically equivalent by nature, cannot be well distinguished by RoleSim if the averages of their neighboring similarities over the automorphically equivalent set are the same. To alleviate this problem: 1) We propose a novel similarity model, namely RoleSim*, which accurately evaluates pairwise role similarities in a more comprehensive manner. RoleSim* not only guarantees the automorphic equivalence that SimRank lacks, but also takes into account the neighboring similarity information outside the automorphically equivalent sets that are overlooked by RoleSim. 2) We prove the existence and uniqueness of the RoleSim* solution, and show its three axiomatic properties (i.e., symmetry, boundedness, and non-increasing monotonicity). 3) We provide a concise bound for iteratively computing RoleSim* formula, and estimate the number of iterations required to attain a desired accuracy. 4) We induce a distance metric based on RoleSim* similarity, and show that the RoleSim* metric fulfills the triangular inequality, which implies the sum-transitivity of its similarity scores. 5) We present a threshold-based RoleSim* model that reduces the computational time further with provable accuracy guarantee. 6) We propose a single-source RoleSim* model, which scales well for sizable graphs. 7) We also devise methods to scale RoleSim* based search by incorporating its triangular inequality property with partitioning techniques. Our experimental results on real datasets demonstrate that RoleSim* achieves higher accuracy than its competitors while scaling well on sizable graphs with billions of edges. Keywords Role-based similarity · Retrieval models and ranking · Web search · Link analysis This article belongs to the Topical Collection: Special Issue on Large Scale Graph Data Analytics Guest Editors: Xuemin Lin, Lu Qin, Wenjie Zhang, and Ying Zhang Weiren Yu Extended author information available on the last page of the article. World Wide Web 1 Introduction RoleSim, conceived by Jin et al. [9], is a promising role-oriented graph-theoretic measure that quantifies the similarity between two objects based on graph automorphism, with a proliferation of real-life applications [9, 10, 25], such as link prediction (social network), co-citation analysis (bibliometrics), motif discovery (bioinformatics), and collaborative filtering (information retrieval). It recursively follows a SimRank-like reasoning that “two nodes are assessed as role similar if they interact with automorphically equivalent sets of in-neighbors”. Intuitively, automorphically equivalent nodes in a graph are objects having similar roles that can be exchanged with minimum effect on the graph structure. Similar to the well-known SimRank measure [7], the recursive nature of RoleSim allows to capture the multi-hop neighboring structures that are automorphically equivalent in a network. Unlike SimRank that measures the similarity of two nodes from the paths connecting them, RoleSim quantifies similarities through the paths connecting their different roles. As a result, two nodes that are disconnected from each other will not be considered as dissimilar by RoleSim if they have similar roles. For evaluating similarity score s(a, b) between nodes a and b, as opposed to SimRank whose similarity s(a, b) takes the average similarity of all the neighboring pairs of (a, b), RoleSim computes s(a, b) by averaging only the similarities over the maximum bipartite matching of all the neighboring pairs of (a, b). This subtle difference enables RoleSim to guarantee the automorphic equivalence, which SimRank lacks, in final scoring results. Therefore, RoleSim has been demonstrated as an effective similarity measure in a wide range of real applications. We summarize two of these applications below. Application 1 (Similarity Search on the Web) Discovering web pages similar to a query page is an important task in information retrieval. In a Web graph, each node represents a web page, and an edge denotes a hyperlink from one page to another. RoleSim can be applied to measure the similarity of two web pages, based on the intuition that “two web pages are role-similar if they are pointed to by the automorphically equivalent sets of their in-neighboring pages”. This similarity measure produces more reliable similarity results than the SimRank model [10]. Application 2 (Social Network De-anonymization) Social network de-anonymization is a method to validate the strength of anonymization algorithms that protect a user’s privacy. RoleSim has been applied to de-anonymise node mappings based on the similarity information between a crawled network and an anonymised one. Based on the observation that “correct mappings tend to have higher similarity scores”, RoleSim iteratively evaluates pairwise node similarities between two networks, and captures the reasoning that “a pair of nodes between two networks is more likely to be a correct mapping if their neighbors are correct mappings”. RoleSim has demonstrated superior performance as compared with other existing de-anonymization algorithms [25]. Despite its popularity in real-world applications, RoleSim has a major limitation: with the aim to achieve automorphic equivalence, its similarity score s(a, b) only considers the limited information of the average similarity scores over the automorphically equivalent set (i.e., the maximum bipartite matching) of a’s and b’s in-neighboring pairs, but neglects the rest of the pairwise in-neighboring similarity information that is outside the automorphically equivalent set. Consequently, RoleSim does not always produce comprehensive similarity results because two pairs of nodes, which are not automorphically equivalent by nature, should be distinguishable from each other even though the average values of their World Wide Web in-neighboring similarities over the set of the maximum bipartite matching are the same, as illustrated in Example 1. Example 1 (Limitation of RoleSim) Consider the web graph G in Figure 1, where each node denotes a web page, and each edge depicts a hyperlink from one page to another. Using RoleSim, we evaluate pairs of similarities between nodes, as partially illustrated in the ‘RS’ column of the right table. It is (...truncated)