GoMIC: Multi-view image clustering via self-supervised contrastive heterogeneous graph co-learning
World Wide Web
https://doi.org/10.1007/s11280-022-01110-6
GoMIC: Multi‑view image clustering via self‑supervised
contrastive heterogeneous graph co‑learning
Uno Fang1 · Jianxin Li1 · Naveed Akhtar2 · Man Li1 · Yan Jia3
Received: 5 August 2022 / Revised: 9 September 2022 / Accepted: 24 September 2022
© The Author(s) 2022
Abstract
Graph learning is being increasingly applied to image clustering to reveal intra-class and
inter-class relationships in data. However, existing graph learning-based image clustering
focuses on grouping images under a single view, which under-utilises the information provided by the data. To address that, we propose a self-supervised multi-view image clustering technique under contrastive heterogeneous graph learning. Our method computes a heterogeneous affinity graph for multi-view image data. It conducts Local Feature Propagation
(LFP) for reasoning over the local neighbourhood of each node and executes an Influenceaware Feature Propagation (IFP) from each node to its influential node for learning the
clustering intention. The proposed framework pioneeringly employs two contrastive objectives. The first targets to contrast and fuse multiple views for the overall LFP embedding,
and the second maximises the mutual information between LFP and IFP representations.
We conduct extensive experiments on the benchmark datasets for the problem, i.e. COIL20, Caltech7 and CASIA-WebFace. Our evaluation shows that our method outperforms the
state-of-the-art methods, including the popular techniques MVGL, MCGC and HeCo.
Keywords Multi-view clustering · Contrastive graph learning · Feature propagation ·
Heterogeneous graph learning
* Jianxin Li
Uno Fang
Naveed Akhtar
Man Li
Yan Jia
1
School of IT, Deakin University, 221 Burwood Highway, 3125 Burwood, VIC, Australia
2
School of Physics, Mathematics and Computing, The University of Western Australia, 35 Stirling
Highway, 6009 Crawley, WA, Australia
3
Department of Computer Science and Technology, Harbin Institute of Technology,
518055 Shenzhen, Guangdong, China
13
Vol.:(0123456789)
World Wide Web
1 Introduction
Clustering is typically thought of as a single view problem in computer vision, where an
algorithm groups individual data samples based on their overall qualities. These samples,
however, may be the outcome of various interpretations or representations of the underlying data. For instance, we can generate different sets of samples as Gabor [1], CLD [2] and
HOG [3] descriptors of the images. These representations may hold complementary properties that can be leveraged for improved clustering. This fact has recently piqued interest
of the computer vision community, resulting in an emerging topic of multi-view clustering
(MVC) [4–12].
Another contemporary line of research for image clustering favors graph-based methods [13–17]. The main benefit of graphs for the clustering problem is that they naturally
have the capacity to encode data structure information. For instance, methods like [13, 14,
18–22] leverage trained Graph Convolutional Networks (GCNs) for images to reason about
the linkage likelihoods between a given node and its neighbours for graph completion,
thereby achieving more accurate clusters.
In general, graph-based methods are known to benefit from Contrastive Learning
(CL) [23], which induces models using self-supervision. During training, it maximises the
agreement between its predictions and the transformed samples of the original sample. For
graphs, the analogous Contrastive Graph Learning (CGL) paradigm aims to maximise the
prediction agreement on different views of the same underlying graph [4–7, 24]. These
views are created by applying random operations, e.g., adding/deleting nodes/edges and
dropping features, to an original graph. In line with the negative sample creation in CL, the
CGL considers other original graphs as the negative samples. It learns node-level (intraview) or graph-level (inter-view) representations - illustrated in Fig. 1(a) - with a graph
neural network and a contrastive loss function.
The self-supervised CGL paradigm naturally suites to the multi-view perspective. For
instance, [25] and [26] created different graph views and then utilised node-level and
graph-level representations for multi-view contrastive learning. These methods consider
structural semantics as global information for learning the node-level embeddings, neglecting the fact that each node can also have various features to provide more information.
Coming back to our main problem of multi-view image clustering, existing methods generally first compute a data affinity matrix for raw features or learned representation under
multiple views, and then perform clustering using the affinity matrix [27–34]. These methods concatenate multiple views to construct a denoised homogeneous graph for image clustering. We provide a simple illustration of a multi-view homogeneous graph for image data
in Fig. 1(b), where views are defined using compositional properties. The graph denoising
operations, however, can lead to the loss of important semantic information. Additionally,
the heterogeneous properties of multi-view data may become meaningless if several views
are combined into a homogeneous graph. Theoretically, by treating images as nodes in heterogeneous graphs, it is possible to use more complementary information for multi-view
image clustering - Fig. 1(c).
Considering the above narrative, in this work, we propose an inductive Multi-view
Image Clustering framework with self-supervised contrastive heterogeneous Graph colearning (GoMIC). In GoMIC, we maintain the relationships between different views as
a heterogeneous affinity graph, while preserving the uniqueness and independence of each
view. Our heterogeneous graph consists of several homogeneous affinity graphs - Fig. 1(d).
Each node can readily get the local neighbourhood data from each view by creating the
13
World Wide Web
Fig. 1 Illustrations of concepts used in the text. (a) Classic Contrastive Graph Learning (CGL) - learns and
contrasts homogeneous graphs at node and graph level. (b) Nodes in a homogeneous graph can have multiple views. (c) Relations among multiple views constitute a heterogeneous affinity graph. vi indicates a ranj
dom original node in the dataset, and mi is the j-th view of vi . Different colors of nodes indicate different
views. (d) A heterogeneous affinity graph can be broken down into multiple affinity homogeneous graphs.
(e) Our Local Feature Propagation (LFP) explores local neighbourhood relations. (f) Our Influence-aware
Feature Propagation (IFP) explores relationships from a target node to the influential node in each view
heterogeneous affinity graph. To understand the propagation of node features, we created
two encoding schemes. In the first, we propagate feature from a node to its neighbourhood
in its own and other views through several hops - Fig. 1(e). The second strategy is influence-aware pr (...truncated)