Evaluation of Data Clustering Accuracy using K-Means Algorithm (pdf)

Article PDF cannot be displayed. You can download it here:

https://risetpress.com/index.php/ijmars/article/download/504/376

Evaluation of Data Clustering Accuracy using K-Means Algorithm

International Journal of Multidisciplinary Approach Research and Science E-ISSN 2987-226X P-ISSN 2988-0076 Volume 2 Issue 01, January 2024, Pp. 385-396 DOI: https://doi.org/10.59653/ijmars.v2i01.504 Copyright by Author Evaluation of Data Clustering Accuracy using K-Means Algorithm Suraya1, Muhammad Sholeh2*, Uning Lestari3 Computer Systems Engineering Study Programme, Faculty of Applied Science, AKPRIND Institute of Science & Technology Yogyakarta, Indonesia | .id1 Informatics Study Programme, Faculty of Information Technology and Business, AKPRIND Institute of Science & Technology Yogyakarta, Indonesia | .id2 Informatics Study Programme, Faculty of Information Technology and Business, AKPRIND Institute of Science & Technology Yogyakarta, Indonesia | 3 Correspondence Author* Received: 05-12-2023 Reviewed: 10-12-2023 Accepted: 21-12-2023 Abstract Data clustering is one of the methods in data science that is often used in data analysis. This method is used in making groupings from a collection of datasheets. Data clustering is done to find patterns or relationships between data. This research aims to evaluate the accuracy of data clustering using K-Means algorithm on wine datasheet. Wine datasheet has 13 features that describe the chemical characteristics of three types of wine. The clustering process must produce the best clustering evaluation metrics. The evaluation metric is done through comparison between the clustering results of K-Means algorithm with Davies Bouldin and Silhouette. The research steps involved data standardization, selection of the optimal number of clusters, and assessment of clustering accuracy. The research method uses KDD which consists of pre-processing, transformation, model building and model evaluation. Experimental results show that appropriate parameters and cluster initialization can improve clustering evaluation metrics. The clustering results show that the normalized datasheet produces evaluation metrics for Davies Bouldin 2 groups and Silhouette produces 3 groups. Before normalization, Davies Bouldin results in 7 groups and Silhouette results in 2 groups. In conclusion, this study produced different evaluation metrics between normalized and nonnormalized datasheets. The selection of the number of groups chosen depends on the context of the data analysis performed and is selected into 3 groups which can be labelled "Superior Variety", the second group "Intermediate Variety" and the third group "Standard Variety". Keywords: Metrics, evaluation, normalized clustering, labels 385 International Journal of Multidisciplinary Approach Research and Science Introduction In the rapidly growing digital era, data processing and analysis have become critical elements in generating valuable information for various fields. One popular approach in data analysis is clustering, which aims to identify patterns or structures hidden in datasets. Clustering is a data analysis method that aims to group objects into groups or clusters based on the similarity of certain characteristics. In this context, objects that have similarities will be placed in one group, while objects that are different will be placed in different groups. The basic concepts in clustering involve the similarity between objects, the formation of clusters as a result of the clustering process, the centroid as the group centre point in the KMeans algorithm, and the measurement of the distance between objects in the feature space (Cielen et al., 2016), (Ozdemir, 2017). There are various clustering methods that can be applied, the selection of methods is adjusted to the characteristics of the data and the purpose of the analysis. Algorithms used in clustering models include K-Means, Hierarchical Clustering, DBSCAN, and Gaussian Mixture Models (Deny Jollyta, Muhammad Siddik, Herman Mawengkang, 2021), (Mathur, 2019). The clustering process can be used for data structure understanding, customer segmentation in the marketing industry, image analysis, and other analysis purposes. The selection of an appropriate clustering method and understanding of the results is critical to ensure the relevance and success of the analysis (Amanda & Veronica Sitorus, 2021), (Garang, 2022). Literature Review The K-Means algorithm is one of the popular data analysis techniques that is widely used in the data clustering process (Purba et al., 2022). It uses a partitioning system to group data based on their similarity, with each cluster represented by a centroid point (Informatics & Polinema, 2020). The process involves iteratively updating the centroid points until the data points are optimally grouped into clusters (Asmiatun et al., 2019). Measuring the accuracy of the K-Means algorithm is very important in evaluating the effectiveness of the clustering process (Awaludin, 2014). The accuracy of the algorithm is influenced by several factors including the initial centroid point value, the number of clusters, and the distance metric used to calculate the similarity between data points (Faizah et al., 2020), (Dewi & Pramita, 2019). The K-Means algorithm has been applied in various studies to cluster data and evaluate the accuracy of the clustering process (Kurniadi et al., 2023), (Tambunan, 2021), (Listiani et al., 2019).Research conducted by Nurjanah (Nurjanah & Arifin, 2021), applied the K-Means method in analysing travel review data. K-Means, as a clustering algorithm, can help identify patterns and groups of words in reviews, enabling a deeper understanding of users' impressions of a place. Natsir (Dewi & Pramita, 2019), made clustering on book data borrowed in the library, Mujiono (Muliono & Sembiring, 2019), used the Kmeans algorithm for clustering data on the tri darma activities of lecturers and Priyatman, H (Priyatman et al., 2019), made a clustering model for use in promotional mapping. 386 Evaluation of Data Clustering Accuracy using K-Means Algorithm The results of the model development must be seen the resulting accuracy value. The process of evaluating the accuracy of the clustering model is used to calculate the best clustering value. Methods for measuring the accuracy of clustering models can include using such as Silhouette Score, Davies-Bouldin Index and Adjusted Rand Index (ARI). Evaluation of clustering models requires the selection of metrics that are appropriate to the task context and data characteristics. These metrics are often used together to provide a more comprehensive picture of the quality of clustering produced by the model. By understanding the strengths and weaknesses of each metric, researchers and practitioners can make more informed decisions in assessing the performance of a clustering model and understand the extent to which the model is able to uncover meaningful structure in the data in the absence of true class labels. The Davies-Bouldin index is calculated with respect to two main aspects, namely the cohesiveness and unity of each group. Cohesiveness measures th (...truncated)