ComStreamClust: a Communicative Multi-Agent Approach to Text Clustering in Streaming Data (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s40745-022-00426-4.pdf

ComStreamClust: a Communicative Multi-Agent Approach to Text Clustering in Streaming Data

Annals of Data Science https://doi.org/10.1007/s40745-022-00426-4 ComStreamClust: a Communicative Multi-Agent Approach to Text Clustering in Streaming Data Ali Najafi1 · Araz Gholipour-Shilabin2 · Rahim Dehkharghani3 · Ali Mohammadpur-Fard4 · Meysam Asgari-Chenaghlu2 Received: 28 August 2021 / Revised: 5 June 2022 / Accepted: 10 June 2022 © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2022 Abstract Topic detection is the task of determining and tracking hot topics in social media. Twitter is arguably the most popular platform for people to share their ideas with others about different issues. One such prevalent issue is the COVID-19 pandemic. Detecting and tracking topics on these kinds of issues would help governments and healthcare companies deal with this phenomenon. In this paper, we propose a novel, multi-agent, communicative clustering approach, so-called ComStreamClust for clustering sub-topics inside a broader topic, e.g., the COVID-19 and the FA CUP. The proposed approach is parallelizable, and can simultaneously handle several data-point. The LaBSE sentence embedding is used to measure the semantic similarity between two tweets. ComStreamClust has been evaluated by several metrics such as keyword precision, keyword recall, and topic recall. Based on topic recall on different number A. Najafi, A. Gholipour-Shilabin these authors contributed equally to this work. B Ali Najafi Araz Gholipour-Shilabin Rahim Dehkharghani Ali Mohammadpur-Fard Meysam Asgari-Chenaghlu 1 Department of Computer Science and Engineering, Sabanci University, Istanbul, Turkey 2 Department of Computer Engineering, University of Tabriz, Tabriz, Iran 3 Computer Engineering Department, Isik University, Istanbul, Turkey 4 Department of Computer Engineering, Sharif University of Technology, Tehran, Iran 123 Annals of Data Science of keywords, ComStreamClust obtains superior results when compared to the existing methods. Keywords Data stream · LaBSE · Semantic similarity · Stream clustering · Topic detection 1 Introduction The utilization of learning in finding patterns from big data is an emerging field that has attracted many researchers and yielded various techniques for diverse problems [1–3]. Moreover, different modalities of data have been an aspect of this research, and recent advancements in this area are creating methods to address pattern findings from unstructured modalities [4, 5]. Social media, which has achieved growing popularity in recent decades, provides the opportunity for people to share their ideas with an enormous number of users worldwide. As a micro-blogging platform, Twitter allows its users to write short text messages regarding various issues ranging from politics, economy, and healthcare to routine tasks of people’s daily lives. One such issue, the COVID-19 pandemic, has had a profound impact on people’s social lives since the beginning of 2020. Determining and tracking health issues such as COVID-19 on Twitter would help governments and healthcare companies better handle the impact of those diseases on societies. Concretely, assembling tweets on this topic and analyzing them may result in invaluable information for those companies. From the healthcare perspective, crawling tweets related to COVID-19 as a pandemic issue might help in finding a remedy for it. As manual processing of such information is prohibitively expensive, automatic or semi-automatic methods are thus needed; however, assembling and distilling such data is a challenging task. Previous works have tackled this problem by streaming and grouping tweets into various categories by using supervised [6] or unsupervised [7] methods. Unsupervised methods, however, could gain greater popularity. These methods collect streaming tweets in a time interval and assign them to clusters based on their topics. Clustering has already been used for topic detection in the literature. In stream data clustering, a two-phase task is accomplished. In the first phase, data are captured from a data stream; and in the second phase, clusters are created and (in this paper) reorganized to constitute denser clusters. The ultimate goal is to increase the intra-cluster similarities and decrease the inter-cluster similarities. Two issues make clustering on streaming data a challenging task: 1) Concept drifting occurs over time and makes the clusters impure. Through constant communication with other agents, ComStreamClust prevents clusters from diverging and snowballing, and 2) The continuously increasing number of clusters would drastically increase the timecomplexity; We used parallelization techniques to overcome this challenge. To the best of the authors’ knowledge, the existing methods in the literature do not address both problems simultaneously. The proposed approach as a multi-agent communicative algorithm addresses both problems and provides a viable solution. 123 Annals of Data Science To tackle the aforementioned problem, we propose a novel, communicative, multiagent, parallelizable text clustering approach for tweet clustering, experimented on the COVID-19 and the FA CUP datasets, which is described with greater details in Sect. 3. The key aspect of this work is its multi-agent and communicative structure. The difference between this work and the existing ones is in the second phase (as mentioned above). In the communication step of the proposed approach, existing clusters may export data to and/or import data from other clusters. At the same time, the proposed approach can also distinguish outlier data and exclude them from their current clusters. All these tasks can be (and have been) accomplished in a parallel setting. The contributions of the proposed approach can be summarized as follows. – ComStreamClust updates clusters by detecting outliers and distributing them among other clusters in streaming, parallel, and multi-agent setting. This setting is being used for the first time in the literature on topic detection problems. – The proposed approach could achieve promising results when applied to the FA CUP dataset. We applied our approach also to this dataset for the sake of fair comparison. Obtained results were as good as or superior to the existing approaches such as LDA, SFPM, and BNgram. – ComStreamClust benefits from a state-of-the-art sentence embedding model, the LaBSE, for measuring the semantic similarity between tweets. – A comprehensive experimental evaluation of the proposed approach on two datasets with different parameter values, such as the number of topics per time slot and the number of keywords per topic, have been conducted. 2 Related Work Ibrahim et. al. [8] divides the topic detection techniques into five groups: clustering, frequent pattern mining, Exemplar-based, matrix factorization, and probabilistic models. The current research falls into the clustering-based models. Stream clustering is a type of clustering in w (...truncated)