ComStreamClust: a Communicative Multi-Agent Approach to Text Clustering in Streaming Data
Annals of Data Science
https://doi.org/10.1007/s40745-022-00426-4
ComStreamClust: a Communicative Multi-Agent Approach
to Text Clustering in Streaming Data
Ali Najafi1 · Araz Gholipour-Shilabin2 · Rahim Dehkharghani3 ·
Ali Mohammadpur-Fard4 · Meysam Asgari-Chenaghlu2
Received: 28 August 2021 / Revised: 5 June 2022 / Accepted: 10 June 2022
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2022
Abstract
Topic detection is the task of determining and tracking hot topics in social media.
Twitter is arguably the most popular platform for people to share their ideas with
others about different issues. One such prevalent issue is the COVID-19 pandemic.
Detecting and tracking topics on these kinds of issues would help governments and
healthcare companies deal with this phenomenon. In this paper, we propose a novel,
multi-agent, communicative clustering approach, so-called ComStreamClust for clustering sub-topics inside a broader topic, e.g., the COVID-19 and the FA CUP. The
proposed approach is parallelizable, and can simultaneously handle several data-point.
The LaBSE sentence embedding is used to measure the semantic similarity between
two tweets. ComStreamClust has been evaluated by several metrics such as keyword
precision, keyword recall, and topic recall. Based on topic recall on different number
A. Najafi, A. Gholipour-Shilabin these authors contributed equally to this work.
B Ali Najafi
Araz Gholipour-Shilabin
Rahim Dehkharghani
Ali Mohammadpur-Fard
Meysam Asgari-Chenaghlu
1
Department of Computer Science and Engineering, Sabanci University, Istanbul, Turkey
2
Department of Computer Engineering, University of Tabriz, Tabriz, Iran
3
Computer Engineering Department, Isik University, Istanbul, Turkey
4
Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
123
Annals of Data Science
of keywords, ComStreamClust obtains superior results when compared to the existing
methods.
Keywords Data stream · LaBSE · Semantic similarity · Stream clustering · Topic
detection
1 Introduction
The utilization of learning in finding patterns from big data is an emerging field that
has attracted many researchers and yielded various techniques for diverse problems
[1–3]. Moreover, different modalities of data have been an aspect of this research, and
recent advancements in this area are creating methods to address pattern findings from
unstructured modalities [4, 5].
Social media, which has achieved growing popularity in recent decades, provides
the opportunity for people to share their ideas with an enormous number of users
worldwide. As a micro-blogging platform, Twitter allows its users to write short text
messages regarding various issues ranging from politics, economy, and healthcare to
routine tasks of people’s daily lives. One such issue, the COVID-19 pandemic, has
had a profound impact on people’s social lives since the beginning of 2020.
Determining and tracking health issues such as COVID-19 on Twitter would help
governments and healthcare companies better handle the impact of those diseases on
societies. Concretely, assembling tweets on this topic and analyzing them may result in
invaluable information for those companies. From the healthcare perspective, crawling
tweets related to COVID-19 as a pandemic issue might help in finding a remedy for
it. As manual processing of such information is prohibitively expensive, automatic
or semi-automatic methods are thus needed; however, assembling and distilling such
data is a challenging task.
Previous works have tackled this problem by streaming and grouping tweets into
various categories by using supervised [6] or unsupervised [7] methods. Unsupervised
methods, however, could gain greater popularity. These methods collect streaming
tweets in a time interval and assign them to clusters based on their topics.
Clustering has already been used for topic detection in the literature. In stream
data clustering, a two-phase task is accomplished. In the first phase, data are captured
from a data stream; and in the second phase, clusters are created and (in this paper) reorganized to constitute denser clusters. The ultimate goal is to increase the intra-cluster
similarities and decrease the inter-cluster similarities.
Two issues make clustering on streaming data a challenging task: 1) Concept drifting
occurs over time and makes the clusters impure. Through constant communication with
other agents, ComStreamClust prevents clusters from diverging and snowballing, and
2) The continuously increasing number of clusters would drastically increase the timecomplexity; We used parallelization techniques to overcome this challenge. To the best
of the authors’ knowledge, the existing methods in the literature do not address both
problems simultaneously. The proposed approach as a multi-agent communicative
algorithm addresses both problems and provides a viable solution.
123
Annals of Data Science
To tackle the aforementioned problem, we propose a novel, communicative, multiagent, parallelizable text clustering approach for tweet clustering, experimented on
the COVID-19 and the FA CUP datasets, which is described with greater details in
Sect. 3. The key aspect of this work is its multi-agent and communicative structure.
The difference between this work and the existing ones is in the second phase (as
mentioned above). In the communication step of the proposed approach, existing
clusters may export data to and/or import data from other clusters. At the same time,
the proposed approach can also distinguish outlier data and exclude them from their
current clusters. All these tasks can be (and have been) accomplished in a parallel
setting. The contributions of the proposed approach can be summarized as follows.
– ComStreamClust updates clusters by detecting outliers and distributing them
among other clusters in streaming, parallel, and multi-agent setting. This setting
is being used for the first time in the literature on topic detection problems.
– The proposed approach could achieve promising results when applied to the FA
CUP dataset. We applied our approach also to this dataset for the sake of fair
comparison. Obtained results were as good as or superior to the existing approaches
such as LDA, SFPM, and BNgram.
– ComStreamClust benefits from a state-of-the-art sentence embedding model, the
LaBSE, for measuring the semantic similarity between tweets.
– A comprehensive experimental evaluation of the proposed approach on two
datasets with different parameter values, such as the number of topics per time
slot and the number of keywords per topic, have been conducted.
2 Related Work
Ibrahim et. al. [8] divides the topic detection techniques into five groups: clustering,
frequent pattern mining, Exemplar-based, matrix factorization, and probabilistic models. The current research falls into the clustering-based models. Stream clustering is
a type of clustering in w (...truncated)