Analysis of Network Clustering Algorithms and Cluster Quality Metrics at Scale
RESEARCH ARTICLE
Analysis of Network Clustering Algorithms
and Cluster Quality Metrics at Scale
Scott Emmons1*, Stephen Kobourov2, Mike Gallant1, Katy Börner1,3
1 School of Informatics and Computing, Indiana University, Bloomington, Indiana, United States of America,
2 Department of Computer Science, University of Arizona, Tucson, Arizona, United States of America,
3 Indiana University Network Science Institute, Indiana University, Bloomington, Indiana, United States of
America
*
a11111
Abstract
Overview
OPEN ACCESS
Citation: Emmons S, Kobourov S, Gallant M, Börner
K (2016) Analysis of Network Clustering Algorithms
and Cluster Quality Metrics at Scale. PLoS ONE 11
(7): e0159161. doi:10.1371/journal.pone.0159161
Editor: Constantine Dovrolis, Georgia Institute of
Technology, UNITED STATES
Received: February 10, 2016
Accepted: June 28, 2016
Published: July 8, 2016
Copyright: © 2016 Emmons et al. This is an open
access article distributed under the terms of the
Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any
medium, provided the original author and source are
credited.
Data Availability Statement: The code we
developed to implement this study, including all
scripts, statistics, and analyses, is available and
documented at http://cns.iu.edu/2016ClusteringComp and Github at https://github.com/
scottemmons/STHClusterAnalysis.
Funding: This research was partially funded by the
National Institutes of Health. This research was
supported in part by Lilly Endowment, Inc., through its
support for the Indiana University Pervasive
Technology Institute, and in part by the Indiana
METACyt Initiative. The Indiana METACyt Initiative at
IU is also supported in part by Lilly Endowment, Inc.
The funders had no role in study design, data
Notions of community quality underlie the clustering of networks. While studies surrounding
network clustering are increasingly common, a precise understanding of the realtionship
between different cluster quality metrics is unknown. In this paper, we examine the relationship between stand-alone cluster quality metrics and information recovery metrics through
a rigorous analysis of four widely-used network clustering algorithms—Louvain, Infomap,
label propagation, and smart local moving. We consider the stand-alone quality metrics of
modularity, conductance, and coverage, and we consider the information recovery metrics
of adjusted Rand score, normalized mutual information, and a variant of normalized mutual
information used in previous work. Our study includes both synthetic graphs and empirical
data sets of sizes varying from 1,000 to 1,000,000 nodes.
Cluster Quality Metrics
We find significant differences among the results of the different cluster quality metrics. For
example, clustering algorithms can return a value of 0.4 out of 1 on modularity but score 0 out
of 1 on information recovery. We find conductance, though imperfect, to be the stand-alone
quality metric that best indicates performance on the information recovery metrics. Additionally, our study shows that the variant of normalized mutual information used in previous work
cannot be assumed to differ only slightly from traditional normalized mutual information.
Network Clustering Algorithms
Smart local moving is the overall best performing algorithm in our study, but discrepancies
between cluster evaluation metrics prevent us from declaring it an absolutely superior algorithm. Interestingly, Louvain performed better than Infomap in nearly all the tests in our
study, contradicting the results of previous work in which Infomap was superior to Louvain.
We find that although label propagation performs poorly when clusters are less clearly
defined, it scales efficiently and accurately to large graphs with well-defined clusters.
PLOS ONE | DOI:10.1371/journal.pone.0159161 July 8, 2016
1 / 18
Analyzing Network Clustering Algorithms and Cluster Quality Metrics
collection and analysis, decision to publish, or
preparation of the manuscript.
Competing Interests: Lilly Endowment, Inc. is a
commercial funder of this work through its support for
the Indiana University Pervasive Technology Institute.
This does not alter the authors’ adherence to PLOS
ONE policies on sharing data and materials.
Introduction
Clustering is the task of assigning a set of objects to groups (also called classes or categories)
so that the objects in the same cluster are more similar (according to a predefined property)
to each other than to those in other clusters. This is a fundamental problem in many fields,
including statistics, data analysis, bioinformatics, and image processing. Some of the classical
clustering methods date back to the early 20th century and the cover a wide spectrum:
connectivity clustering, centroid clustering, density clustering, etc. The result of clustering may
be a hierarchy or partition with disjoint or overlapping clusters. Cluster attributes such as
count (number of clusters), average size, minimum size, maximum size, etc., are often of
interest.
To evaluate and compare network clustering algorithms, the literature has given much
attention to algorithms’ performance on “benchmark graphs” [1–5]. Benchmark graphs are
synthetic graphs into which a known clustering can be embedded by construction. The
embedded clustering is treated as a “gold standard,” and clustering algorithms are judged on
their ability to recover the information in the embedded clustering. In such synthetic graphs
there is a clear definition of rank: the best clustering algorithm is the one that recovers the
most information, and the worst clustering algorithm is the one that recovers the least
information.
However, judging clustering algorithms based solely by their performance on benchmark
graph tests assumes that the embedded clustering truly is a “gold standard” that captures the
entirety of an algorithm’s performance. It ignores other properties of clustering, such as modularity, conductance, and coverage, to which the literature has given much attention in order to
decide the best clustering algorithm to use in practice for a particular application [6–8].
Furthermore, previous papers that have evaluated clustering algorithms on benchmark
graphs have used a single metric, such as normalized mutual information, to measure the
amount of “gold standard” information recovered by each algorithm [3–5]. We have seen no
studies that evaluate how the choice of information recovery metric affects the results of benchmark graph cluster analysis.
In this paper, we experimentally evaluate the robustness of clustering algorithms by their
performance on small (1,000 nodes, 12,400 undirected edges) to large-scale (1M nodes, 13.3M
undirected edges) benchmark graphs. We cluster these graphs using a variety of clustering
algorithms and simultaneously measure both the information recovery of each clustering and
the quality of each clustering with various metrics. Then, we test (...truncated)