Evaluation of Data Clustering Accuracy using K-Means Algorithm
International Journal of Multidisciplinary Approach Research and Science
E-ISSN 2987-226X P-ISSN 2988-0076
Volume 2 Issue 01, January 2024, Pp. 385-396
DOI: https://doi.org/10.59653/ijmars.v2i01.504
Copyright by Author
Evaluation of Data Clustering Accuracy using K-Means
Algorithm
Suraya1, Muhammad Sholeh2*, Uning Lestari3
Computer Systems Engineering Study Programme, Faculty of Applied Science, AKPRIND Institute of
Science & Technology Yogyakarta, Indonesia | .id1
Informatics Study Programme, Faculty of Information Technology and Business, AKPRIND Institute of
Science & Technology Yogyakarta, Indonesia | .id2
Informatics Study Programme, Faculty of Information Technology and Business, AKPRIND Institute of
Science & Technology Yogyakarta, Indonesia | 3
Correspondence Author*
Received: 05-12-2023
Reviewed: 10-12-2023
Accepted: 21-12-2023
Abstract
Data clustering is one of the methods in data science that is often used in data analysis. This
method is used in making groupings from a collection of datasheets. Data clustering is done to
find patterns or relationships between data. This research aims to evaluate the accuracy of data
clustering using K-Means algorithm on wine datasheet. Wine datasheet has 13 features that
describe the chemical characteristics of three types of wine. The clustering process must
produce the best clustering evaluation metrics. The evaluation metric is done through
comparison between the clustering results of K-Means algorithm with Davies Bouldin and
Silhouette. The research steps involved data standardization, selection of the optimal number
of clusters, and assessment of clustering accuracy. The research method uses KDD which
consists of pre-processing, transformation, model building and model evaluation. Experimental
results show that appropriate parameters and cluster initialization can improve clustering
evaluation metrics. The clustering results show that the normalized datasheet produces
evaluation metrics for Davies Bouldin 2 groups and Silhouette produces 3 groups. Before
normalization, Davies Bouldin results in 7 groups and Silhouette results in 2 groups. In
conclusion, this study produced different evaluation metrics between normalized and nonnormalized datasheets. The selection of the number of groups chosen depends on the context
of the data analysis performed and is selected into 3 groups which can be labelled "Superior
Variety", the second group "Intermediate Variety" and the third group "Standard Variety".
Keywords: Metrics, evaluation, normalized clustering, labels
385
International Journal of Multidisciplinary Approach Research and Science
Introduction
In the rapidly growing digital era, data processing and analysis have become critical
elements in generating valuable information for various fields. One popular approach in data
analysis is clustering, which aims to identify patterns or structures hidden in datasets.
Clustering is a data analysis method that aims to group objects into groups or clusters
based on the similarity of certain characteristics. In this context, objects that have similarities
will be placed in one group, while objects that are different will be placed in different groups.
The basic concepts in clustering involve the similarity between objects, the formation of
clusters as a result of the clustering process, the centroid as the group centre point in the KMeans algorithm, and the measurement of the distance between objects in the feature space
(Cielen et al., 2016), (Ozdemir, 2017).
There are various clustering methods that can be applied, the selection of methods is
adjusted to the characteristics of the data and the purpose of the analysis. Algorithms used in
clustering models include K-Means, Hierarchical Clustering, DBSCAN, and Gaussian Mixture
Models (Deny Jollyta, Muhammad Siddik, Herman Mawengkang, 2021), (Mathur, 2019). The
clustering process can be used for data structure understanding, customer segmentation in the
marketing industry, image analysis, and other analysis purposes. The selection of an
appropriate clustering method and understanding of the results is critical to ensure the relevance
and success of the analysis (Amanda & Veronica Sitorus, 2021), (Garang, 2022).
Literature Review
The K-Means algorithm is one of the popular data analysis techniques that is widely used
in the data clustering process (Purba et al., 2022). It uses a partitioning system to group data
based on their similarity, with each cluster represented by a centroid point (Informatics &
Polinema, 2020). The process involves iteratively updating the centroid points until the data
points are optimally grouped into clusters (Asmiatun et al., 2019). Measuring the accuracy of
the K-Means algorithm is very important in evaluating the effectiveness of the clustering
process (Awaludin, 2014). The accuracy of the algorithm is influenced by several factors
including the initial centroid point value, the number of clusters, and the distance metric used
to calculate the similarity between data points (Faizah et al., 2020), (Dewi & Pramita, 2019).
The K-Means algorithm has been applied in various studies to cluster data and evaluate
the accuracy of the clustering process (Kurniadi et al., 2023), (Tambunan, 2021), (Listiani et
al., 2019).Research conducted by Nurjanah (Nurjanah & Arifin, 2021), applied the K-Means
method in analysing travel review data. K-Means, as a clustering algorithm, can help identify
patterns and groups of words in reviews, enabling a deeper understanding of users' impressions
of a place. Natsir (Dewi & Pramita, 2019), made clustering on book data borrowed in the
library, Mujiono (Muliono & Sembiring, 2019), used the Kmeans algorithm for clustering data
on the tri darma activities of lecturers and Priyatman, H (Priyatman et al., 2019), made a
clustering model for use in promotional mapping.
386
Evaluation of Data Clustering Accuracy using K-Means Algorithm
The results of the model development must be seen the resulting accuracy value. The
process of evaluating the accuracy of the clustering model is used to calculate the best
clustering value. Methods for measuring the accuracy of clustering models can include using
such as Silhouette Score, Davies-Bouldin Index and Adjusted Rand Index (ARI).
Evaluation of clustering models requires the selection of metrics that are appropriate to
the task context and data characteristics. These metrics are often used together to provide a
more comprehensive picture of the quality of clustering produced by the model. By
understanding the strengths and weaknesses of each metric, researchers and practitioners can
make more informed decisions in assessing the performance of a clustering model and
understand the extent to which the model is able to uncover meaningful structure in the data in
the absence of true class labels.
The Davies-Bouldin index is calculated with respect to two main aspects, namely the
cohesiveness and unity of each group. Cohesiveness measures th (...truncated)