Comprehensive analysis of clustering algorithms: exploring limitations and innovative solutions (pdf)

Article PDF cannot be displayed. You can download it here:

https://peerj.com/articles/cs-2286.pdf

Comprehensive analysis of clustering algorithms: exploring limitations and innovative solutions

Comprehensive analysis of clustering algorithms: exploring limitations and innovative solutions Aasim Ayaz Wani School of Engineering, Cornell University, Ithaca, New York, United States ABSTRACT This survey rigorously explores contemporary clustering algorithms within the machine learning paradigm, focusing on ﬁve primary methodologies: centroid-based, hierarchical, density-based, distribution-based, and graph-based clustering. Through the lens of recent innovations such as deep embedded clustering and spectral clustering, we analyze the strengths, limitations, and the breadth of application domains—ranging from bioinformatics to social network analysis. Notably, the survey introduces novel contributions by integrating clustering techniques with dimensionality reduction and proposing advanced ensemble methods to enhance stability and accuracy across varied data structures. This work uniquely synthesizes the latest advancements and offers new perspectives on overcoming traditional challenges like scalability and noise sensitivity, thus providing a comprehensive roadmap for future research and practical applications in data-intensive environments. Subjects Artiﬁcial Intelligence, Data Mining and Machine Learning, Data Science Keywords Clustering algorithms, Unsupervised learning, Scalability and efﬁciency, Centroid-based clustering, Hierarchical clustering, Density-based clustering, Distribution-based clustering, Clustering challenges and solutions INTRODUCTION Submitted 22 May 2024 Accepted 6 August 2024 Published 29 August 2024 Corresponding author Aasim Ayaz Wani, Academic editor Davide Chicco Additional Information and Declarations can be found on page 38 DOI 10.7717/peerj-cs.2286 Copyright 2024 Wani Distributed under Creative Commons CC-BY 4.0 Clustering algorithms constitute a fundamental component of unsupervised machine learning, facilitating the discovery of hidden patterns and structures within unlabeled datasets. These algorithms partition data points into distinct groups or clusters based on their inherent similarities, ensuring that points within a cluster are more similar to each other than to those in other clusters. These techniques are critical across diverse ﬁelds such as bioinformatics, image segmentation, anomaly detection, and customer segmentation (Lan et al., 2018; Soﬁ & Wani, 2021; Feng et al., 2023). These applications underscore the signiﬁcant role of clustering in extracting valuable insights from the vast amounts of data generated daily (Jun, Yoo & Choi, 2018; Xu & Tian, 2015). But, despite their widespread application, clustering algorithms often face signiﬁcant challenges when dealing with highdimensional, noisy, and large-scale data. While previous surveys have provided valuable overviews of various clustering algorithms, the rapid advancements in the ﬁeld necessitate an updated and comprehensive analysis of the latest techniques, their limitations, and innovative solutions (Fahad et al., 2014; Xu & Tian, 2015). This survey article aims to bridge this gap by providing an indepth examination of both classical and state-of-the-art clustering algorithms, with a How to cite this article Wani AA. 2024. Comprehensive analysis of clustering algorithms: exploring limitations and innovative solutions. PeerJ Comput. Sci. 10:e2286 DOI 10.7717/peerj-cs.2286 particular focus on their methodologies, strengths, and weaknesses. Moreover, we identify and discuss key challenges faced by clustering algorithms, such as the curse of dimensionality, initialization sensitivity, and scalability issues, and propose advanced solutions to overcome these obstacles. The main objectives and contributions of this survey are as follows: . Provide a comprehensive and up-to-date analysis of various clustering techniques, including centroid, hierarchical, density, distribution, autoencoders and graph-based clustering methods. . Discuss the methodologies, strengths, and limitations of each category of clustering algorithms, along with their practical applications across multiple domains. . Identify key challenges and limitations of existing clustering algorithms. . Propose and analyze advanced solutions to address these challenges, including dimensionality reduction techniques, ensemble clustering, and other state-of-the-art approaches. . Highlight the importance of integrating clustering with other machine learning paradigms and emphasize the need for robust validation metrics to assess clustering outcomes effectively. This article aims to bridge the gap between classical clustering methods and contemporary advancements by providing a comprehensive analysis of both traditional and state-of-the-art clustering algorithms. Our goal is to stimulate further research and development of clustering algorithms that are more efﬁcient, robust, and adaptable to the complexities of real-world data. By addressing these issues and highlighting the importance of integrating clustering with other machine learning paradigms, we aim to contribute valuable insights and foster advancements in the ﬁeld. This survey serves as a resource for researchers and practitioners, offering guidance on the selection and application of clustering techniques tailored to speciﬁc data characteristics and analytical needs. The remainder of this article is organized as follows: “Categorization of Clustering Algorithms” details various clustering methods discussing their methodologies and applications. “Practical Challenges of Existing Clustering Methods” explores the limitations and challenges faced by current clustering algorithms in various application scenarios. “Solutions for Overcoming Clustering Limitations” proposes innovative solutions and advanced methodologies to address these challenges. Finally, “Conclusions and Future Work” summarizes the ﬁndings of this survey and discusses potential future research directions in the ﬁeld of clustering algorithms. Survey/search methodology To ensure comprehensive and unbiased coverage of the literature, we employed a systematic and rigorous search methodology. We utilized multiple reputable search engines and academic databases, including Google Scholar, PubMed and IEEE Xplore Wani (2024), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2286 2/45 chosen for their extensive coverage of computer science and data analysis research. Our search used a combination of terms such as “clustering algorithms”, “centroid-based clustering”, “K-means clustering”, “hierarchical clustering,” “density-based clustering”, “distribution-based clustering”, “Gaussian Mixture Models”, “graph-based clustering”, “clustering in high-dimensional data”, “clustering performance evaluation” and “clustering challenges and solutions”. Boolean operators (AND, OR) reﬁned the queries to include studies directly addressing our research questions. Inclusion criteria were articles that focused on clustering algorithms and their applications, published within the last 15 years, peer-revi (...truncated)