RoleSim*: Scaling axiomatic role-based similarity ranking on large graphs
World Wide Web
https://doi.org/10.1007/s11280-021-00925-z
RoleSim*: Scaling axiomatic role-based similarity
ranking on large graphs
Weiren Yu1,2 · Sima Iranmanesh2 · Aparajita Haldar2 · Maoyin Zhang1 ·
Hakan Ferhatosmanoglu2
Received: 4 February 2021 / Revised: 28 May 2021 / Accepted: 7 July 2021 /
© The Author(s) 2021
Abstract
RoleSim and SimRank are among the popular graph-theoretic similarity measures with
many applications in, e.g., web search, collaborative filtering, and sociometry. While
RoleSim addresses the automorphic (role) equivalence of pairwise similarity which SimRank lacks, it ignores the neighboring similarity information out of the automorphically
equivalent set. Consequently, two pairs of nodes, which are not automorphically equivalent
by nature, cannot be well distinguished by RoleSim if the averages of their neighboring similarities over the automorphically equivalent set are the same. To alleviate this problem: 1)
We propose a novel similarity model, namely RoleSim*, which accurately evaluates pairwise role similarities in a more comprehensive manner. RoleSim* not only guarantees the
automorphic equivalence that SimRank lacks, but also takes into account the neighboring
similarity information outside the automorphically equivalent sets that are overlooked by
RoleSim. 2) We prove the existence and uniqueness of the RoleSim* solution, and show its
three axiomatic properties (i.e., symmetry, boundedness, and non-increasing monotonicity).
3) We provide a concise bound for iteratively computing RoleSim* formula, and estimate
the number of iterations required to attain a desired accuracy. 4) We induce a distance
metric based on RoleSim* similarity, and show that the RoleSim* metric fulfills the triangular inequality, which implies the sum-transitivity of its similarity scores. 5) We present a
threshold-based RoleSim* model that reduces the computational time further with provable
accuracy guarantee. 6) We propose a single-source RoleSim* model, which scales well for
sizable graphs. 7) We also devise methods to scale RoleSim* based search by incorporating its triangular inequality property with partitioning techniques. Our experimental results
on real datasets demonstrate that RoleSim* achieves higher accuracy than its competitors
while scaling well on sizable graphs with billions of edges.
Keywords Role-based similarity · Retrieval models and ranking · Web search · Link
analysis
This article belongs to the Topical Collection: Special Issue on Large Scale Graph Data Analytics
Guest Editors: Xuemin Lin, Lu Qin, Wenjie Zhang, and Ying Zhang
Weiren Yu
Extended author information available on the last page of the article.
World Wide Web
1 Introduction
RoleSim, conceived by Jin et al. [9], is a promising role-oriented graph-theoretic measure
that quantifies the similarity between two objects based on graph automorphism, with a
proliferation of real-life applications [9, 10, 25], such as link prediction (social network),
co-citation analysis (bibliometrics), motif discovery (bioinformatics), and collaborative filtering (information retrieval). It recursively follows a SimRank-like reasoning that “two
nodes are assessed as role similar if they interact with automorphically equivalent sets of
in-neighbors”. Intuitively, automorphically equivalent nodes in a graph are objects having
similar roles that can be exchanged with minimum effect on the graph structure. Similar
to the well-known SimRank measure [7], the recursive nature of RoleSim allows to capture the multi-hop neighboring structures that are automorphically equivalent in a network.
Unlike SimRank that measures the similarity of two nodes from the paths connecting them,
RoleSim quantifies similarities through the paths connecting their different roles. As a
result, two nodes that are disconnected from each other will not be considered as dissimilar
by RoleSim if they have similar roles. For evaluating similarity score s(a, b) between nodes
a and b, as opposed to SimRank whose similarity s(a, b) takes the average similarity of all
the neighboring pairs of (a, b), RoleSim computes s(a, b) by averaging only the similarities
over the maximum bipartite matching of all the neighboring pairs of (a, b). This subtle difference enables RoleSim to guarantee the automorphic equivalence, which SimRank lacks,
in final scoring results. Therefore, RoleSim has been demonstrated as an effective similarity measure in a wide range of real applications. We summarize two of these applications
below.
Application 1 (Similarity Search on the Web) Discovering web pages similar to a query
page is an important task in information retrieval. In a Web graph, each node represents
a web page, and an edge denotes a hyperlink from one page to another. RoleSim can be
applied to measure the similarity of two web pages, based on the intuition that “two web
pages are role-similar if they are pointed to by the automorphically equivalent sets of their
in-neighboring pages”. This similarity measure produces more reliable similarity results
than the SimRank model [10].
Application 2 (Social Network De-anonymization) Social network de-anonymization is
a method to validate the strength of anonymization algorithms that protect a user’s privacy. RoleSim has been applied to de-anonymise node mappings based on the similarity
information between a crawled network and an anonymised one. Based on the observation
that “correct mappings tend to have higher similarity scores”, RoleSim iteratively evaluates
pairwise node similarities between two networks, and captures the reasoning that “a pair
of nodes between two networks is more likely to be a correct mapping if their neighbors
are correct mappings”. RoleSim has demonstrated superior performance as compared with
other existing de-anonymization algorithms [25].
Despite its popularity in real-world applications, RoleSim has a major limitation: with
the aim to achieve automorphic equivalence, its similarity score s(a, b) only considers the
limited information of the average similarity scores over the automorphically equivalent set
(i.e., the maximum bipartite matching) of a’s and b’s in-neighboring pairs, but neglects
the rest of the pairwise in-neighboring similarity information that is outside the automorphically equivalent set. Consequently, RoleSim does not always produce comprehensive
similarity results because two pairs of nodes, which are not automorphically equivalent by
nature, should be distinguishable from each other even though the average values of their
World Wide Web
in-neighboring similarities over the set of the maximum bipartite matching are the same, as
illustrated in Example 1.
Example 1 (Limitation of RoleSim) Consider the web graph G in Figure 1, where each
node denotes a web page, and each edge depicts a hyperlink from one page to another.
Using RoleSim, we evaluate pairs of similarities between nodes, as partially illustrated in
the ‘RS’ column of the right table. It is (...truncated)