Comprehensive analysis of clustering algorithms: exploring limitations and innovative solutions
Comprehensive analysis of clustering
algorithms: exploring limitations and
innovative solutions
Aasim Ayaz Wani
School of Engineering, Cornell University, Ithaca, New York, United States
ABSTRACT
This survey rigorously explores contemporary clustering algorithms within the
machine learning paradigm, focusing on five primary methodologies: centroid-based,
hierarchical, density-based, distribution-based, and graph-based clustering. Through
the lens of recent innovations such as deep embedded clustering and spectral
clustering, we analyze the strengths, limitations, and the breadth of application
domains—ranging from bioinformatics to social network analysis. Notably, the
survey introduces novel contributions by integrating clustering techniques with
dimensionality reduction and proposing advanced ensemble methods to enhance
stability and accuracy across varied data structures. This work uniquely synthesizes
the latest advancements and offers new perspectives on overcoming traditional
challenges like scalability and noise sensitivity, thus providing a comprehensive
roadmap for future research and practical applications in data-intensive
environments.
Subjects Artificial Intelligence, Data Mining and Machine Learning, Data Science
Keywords Clustering algorithms, Unsupervised learning, Scalability and efficiency, Centroid-based
clustering, Hierarchical clustering, Density-based clustering, Distribution-based clustering,
Clustering challenges and solutions
INTRODUCTION
Submitted 22 May 2024
Accepted 6 August 2024
Published 29 August 2024
Corresponding author
Aasim Ayaz Wani,
Academic editor
Davide Chicco
Additional Information and
Declarations can be found on
page 38
DOI 10.7717/peerj-cs.2286
Copyright
2024 Wani
Distributed under
Creative Commons CC-BY 4.0
Clustering algorithms constitute a fundamental component of unsupervised machine
learning, facilitating the discovery of hidden patterns and structures within unlabeled
datasets. These algorithms partition data points into distinct groups or clusters based on
their inherent similarities, ensuring that points within a cluster are more similar to each
other than to those in other clusters. These techniques are critical across diverse fields such
as bioinformatics, image segmentation, anomaly detection, and customer segmentation
(Lan et al., 2018; Sofi & Wani, 2021; Feng et al., 2023). These applications underscore the
significant role of clustering in extracting valuable insights from the vast amounts of data
generated daily (Jun, Yoo & Choi, 2018; Xu & Tian, 2015). But, despite their widespread
application, clustering algorithms often face significant challenges when dealing with highdimensional, noisy, and large-scale data.
While previous surveys have provided valuable overviews of various clustering
algorithms, the rapid advancements in the field necessitate an updated and comprehensive
analysis of the latest techniques, their limitations, and innovative solutions (Fahad et al.,
2014; Xu & Tian, 2015). This survey article aims to bridge this gap by providing an indepth examination of both classical and state-of-the-art clustering algorithms, with a
How to cite this article Wani AA. 2024. Comprehensive analysis of clustering algorithms: exploring limitations and innovative solutions.
PeerJ Comput. Sci. 10:e2286 DOI 10.7717/peerj-cs.2286
particular focus on their methodologies, strengths, and weaknesses. Moreover, we identify
and discuss key challenges faced by clustering algorithms, such as the curse of
dimensionality, initialization sensitivity, and scalability issues, and propose advanced
solutions to overcome these obstacles. The main objectives and contributions of this survey
are as follows:
.
Provide a comprehensive and up-to-date analysis of various clustering techniques,
including centroid, hierarchical, density, distribution, autoencoders and graph-based
clustering methods.
.
Discuss the methodologies, strengths, and limitations of each category of clustering
algorithms, along with their practical applications across multiple domains.
.
Identify key challenges and limitations of existing clustering algorithms.
.
Propose and analyze advanced solutions to address these challenges, including
dimensionality reduction techniques, ensemble clustering, and other state-of-the-art
approaches.
.
Highlight the importance of integrating clustering with other machine learning
paradigms and emphasize the need for robust validation metrics to assess clustering
outcomes effectively.
This article aims to bridge the gap between classical clustering methods and
contemporary advancements by providing a comprehensive analysis of both traditional
and state-of-the-art clustering algorithms. Our goal is to stimulate further research and
development of clustering algorithms that are more efficient, robust, and adaptable to the
complexities of real-world data. By addressing these issues and highlighting the
importance of integrating clustering with other machine learning paradigms, we aim to
contribute valuable insights and foster advancements in the field. This survey serves as a
resource for researchers and practitioners, offering guidance on the selection and
application of clustering techniques tailored to specific data characteristics and analytical
needs.
The remainder of this article is organized as follows: “Categorization of Clustering
Algorithms” details various clustering methods discussing their methodologies and
applications. “Practical Challenges of Existing Clustering Methods” explores the
limitations and challenges faced by current clustering algorithms in various application
scenarios. “Solutions for Overcoming Clustering Limitations” proposes innovative
solutions and advanced methodologies to address these challenges. Finally, “Conclusions
and Future Work” summarizes the findings of this survey and discusses potential future
research directions in the field of clustering algorithms.
Survey/search methodology
To ensure comprehensive and unbiased coverage of the literature, we employed a
systematic and rigorous search methodology. We utilized multiple reputable search
engines and academic databases, including Google Scholar, PubMed and IEEE Xplore
Wani (2024), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2286
2/45
chosen for their extensive coverage of computer science and data analysis research. Our
search used a combination of terms such as “clustering algorithms”, “centroid-based
clustering”, “K-means clustering”, “hierarchical clustering,” “density-based clustering”,
“distribution-based clustering”, “Gaussian Mixture Models”, “graph-based clustering”,
“clustering in high-dimensional data”, “clustering performance evaluation” and “clustering
challenges and solutions”. Boolean operators (AND, OR) refined the queries to include
studies directly addressing our research questions. Inclusion criteria were articles that
focused on clustering algorithms and their applications, published within the last 15 years,
peer-revi (...truncated)