GoMIC: Multi-view image clustering via self-supervised contrastive heterogeneous graph co-learning (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s11280-022-01110-6.pdf

GoMIC: Multi-view image clustering via self-supervised contrastive heterogeneous graph co-learning

World Wide Web https://doi.org/10.1007/s11280-022-01110-6 GoMIC: Multi‑view image clustering via self‑supervised contrastive heterogeneous graph co‑learning Uno Fang1 · Jianxin Li1 · Naveed Akhtar2 · Man Li1 · Yan Jia3 Received: 5 August 2022 / Revised: 9 September 2022 / Accepted: 24 September 2022 © The Author(s) 2022 Abstract Graph learning is being increasingly applied to image clustering to reveal intra-class and inter-class relationships in data. However, existing graph learning-based image clustering focuses on grouping images under a single view, which under-utilises the information provided by the data. To address that, we propose a self-supervised multi-view image clustering technique under contrastive heterogeneous graph learning. Our method computes a heterogeneous affinity graph for multi-view image data. It conducts Local Feature Propagation (LFP) for reasoning over the local neighbourhood of each node and executes an Influenceaware Feature Propagation (IFP) from each node to its influential node for learning the clustering intention. The proposed framework pioneeringly employs two contrastive objectives. The first targets to contrast and fuse multiple views for the overall LFP embedding, and the second maximises the mutual information between LFP and IFP representations. We conduct extensive experiments on the benchmark datasets for the problem, i.e. COIL20, Caltech7 and CASIA-WebFace. Our evaluation shows that our method outperforms the state-of-the-art methods, including the popular techniques MVGL, MCGC and HeCo. Keywords Multi-view clustering · Contrastive graph learning · Feature propagation · Heterogeneous graph learning * Jianxin Li Uno Fang Naveed Akhtar Man Li Yan Jia 1 School of IT, Deakin University, 221 Burwood Highway, 3125 Burwood, VIC, Australia 2 School of Physics, Mathematics and Computing, The University of Western Australia, 35 Stirling Highway, 6009 Crawley, WA, Australia 3 Department of Computer Science and Technology, Harbin Institute of Technology, 518055 Shenzhen, Guangdong, China 13 Vol.:(0123456789) World Wide Web 1 Introduction Clustering is typically thought of as a single view problem in computer vision, where an algorithm groups individual data samples based on their overall qualities. These samples, however, may be the outcome of various interpretations or representations of the underlying data. For instance, we can generate different sets of samples as Gabor [1], CLD [2] and HOG [3] descriptors of the images. These representations may hold complementary properties that can be leveraged for improved clustering. This fact has recently piqued interest of the computer vision community, resulting in an emerging topic of multi-view clustering (MVC) [4–12]. Another contemporary line of research for image clustering favors graph-based methods [13–17]. The main benefit of graphs for the clustering problem is that they naturally have the capacity to encode data structure information. For instance, methods like [13, 14, 18–22] leverage trained Graph Convolutional Networks (GCNs) for images to reason about the linkage likelihoods between a given node and its neighbours for graph completion, thereby achieving more accurate clusters. In general, graph-based methods are known to benefit from Contrastive Learning (CL) [23], which induces models using self-supervision. During training, it maximises the agreement between its predictions and the transformed samples of the original sample. For graphs, the analogous Contrastive Graph Learning (CGL) paradigm aims to maximise the prediction agreement on different views of the same underlying graph [4–7, 24]. These views are created by applying random operations, e.g., adding/deleting nodes/edges and dropping features, to an original graph. In line with the negative sample creation in CL, the CGL considers other original graphs as the negative samples. It learns node-level (intraview) or graph-level (inter-view) representations - illustrated in Fig. 1(a) - with a graph neural network and a contrastive loss function. The self-supervised CGL paradigm naturally suites to the multi-view perspective. For instance, [25] and [26] created different graph views and then utilised node-level and graph-level representations for multi-view contrastive learning. These methods consider structural semantics as global information for learning the node-level embeddings, neglecting the fact that each node can also have various features to provide more information. Coming back to our main problem of multi-view image clustering, existing methods generally first compute a data affinity matrix for raw features or learned representation under multiple views, and then perform clustering using the affinity matrix [27–34]. These methods concatenate multiple views to construct a denoised homogeneous graph for image clustering. We provide a simple illustration of a multi-view homogeneous graph for image data in Fig. 1(b), where views are defined using compositional properties. The graph denoising operations, however, can lead to the loss of important semantic information. Additionally, the heterogeneous properties of multi-view data may become meaningless if several views are combined into a homogeneous graph. Theoretically, by treating images as nodes in heterogeneous graphs, it is possible to use more complementary information for multi-view image clustering - Fig. 1(c). Considering the above narrative, in this work, we propose an inductive Multi-view Image Clustering framework with self-supervised contrastive heterogeneous Graph colearning (GoMIC). In GoMIC, we maintain the relationships between different views as a heterogeneous affinity graph, while preserving the uniqueness and independence of each view. Our heterogeneous graph consists of several homogeneous affinity graphs - Fig. 1(d). Each node can readily get the local neighbourhood data from each view by creating the 13 World Wide Web Fig. 1 Illustrations of concepts used in the text. (a) Classic Contrastive Graph Learning (CGL) - learns and contrasts homogeneous graphs at node and graph level. (b) Nodes in a homogeneous graph can have multiple views. (c) Relations among multiple views constitute a heterogeneous affinity graph. vi indicates a ranj dom original node in the dataset, and mi is the j-th view of vi . Different colors of nodes indicate different views. (d) A heterogeneous affinity graph can be broken down into multiple affinity homogeneous graphs. (e) Our Local Feature Propagation (LFP) explores local neighbourhood relations. (f) Our Influence-aware Feature Propagation (IFP) explores relationships from a target node to the influential node in each view heterogeneous affinity graph. To understand the propagation of node features, we created two encoding schemes. In the first, we propagate feature from a node to its neighbourhood in its own and other views through several hops - Fig. 1(e). The second strategy is influence-aware pr (...truncated)