A Novel Artificial Bee Colony Based Clustering Algorithm for Categorical Data (pdf)

Article PDF cannot be displayed. You can download it here:

https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0127125&type=printable

A Novel Artificial Bee Colony Based Clustering Algorithm for Categorical Data

May A Novel Artificial Bee Colony Based Clustering Algorithm for Categorical Data Jinchao Ji 0 1 Wei Pang 0 1 Yanlin Zheng 0 1 Zhe Wang 0 1 Zhiqiang Ma 0 1 0 1 School of Computer Science and Information Technology, Northeast Normal University , Changchun , China , 2 Key Lab of Intelligent Information Processing of Jilin Universities, Northeast Normal University , Changchun , China , 3 Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University , Changchun , China , 4 School of Natural and Computing Sciences, University of Aberdeen , Aberdeen , United Kingdom, 5 College of Computer Science and Technology, Jilin University , Changchun , China 1 Academic Editor: Fengfeng Zhou, Shenzhen Institutes of Advanced Technology , CHINA Data with categorical attributes are ubiquitous in the real world. However, existing partitional clustering algorithms for categorical data are prone to fall into local optima. To address this issue, in this paper we propose a novel clustering algorithm, ABC-K-Modes (Artificial Bee Colony clustering based on K-Modes), based on the traditional k-modes clustering algorithm and the artificial bee colony approach. In our approach, we first introduce a one-step k-modes procedure, and then integrate this procedure with the artificial bee colony approach to deal with categorical data. In the search process performed by scout bees, we adopt the multi-source search inspired by the idea of batch processing to accelerate the convergence of ABC-K-Modes. The performance of ABC-K-Modes is evaluated by a series of experiments in comparison with that of the other popular algorithms for categorical data. - Funding: This work was supported by the National Natural Science Foundation of China (NSFC) under Grant Nos. (21127010, 61202309, http://www.nsfc. gov.cn/), China Postdoctoral Science Foundation under Grant No. 2013M530956 (http://res. chinapostdoctor.org.cn), the UK Economic & Social Research Council (ESRC): award reference: ES/ M001628/1 (http://www.esrc.ac.uk/), Science and Technology Development Plan of Jilin province under Grant No. 20140520068JH (http://www.jlkjt.gov.cn), Fundamental Research Funds for the Central As an important technique in data mining, clustering analysis has been used in many fields [1,2], such as information retrieval [3], social media analysis [4], privacy preserving [5], image analysis [6], text analysis [7], and bioinformatics [8]. The aim of clustering is to group those data objects with similar characteristics into the same clusters, and the ones with dissimilar characteristics into different clusters. Most existing clustering algorithms in the literature belong to one of the following two types: hierarchical and partitional. Hierarchical clustering algorithms allocate a group of data objects into a dendrogram of the nested partitions according to a divisive or agglomerative strategy [9]. While partitional clustering algorithms partition a set of data objects into a pre-defined number of clusters by optimizing an objective cost function. Center-based clustering algorithms are the most popular partitional clustering algorithms. The k-means algorithm is a widely used center-based partitional clustering algorithm due to its simplicity and high efficiency [10]. Considering the uncertainty of data objects, the fuzzy kUniversities under No. 14QNJJ028 (http://www.nenu. edu.cn), the open project program of Key Laboratory of Symbolic Computation andKnowledge Engineering of Ministry of Education, Jilin University under Grant No. 93K172014K07 (http://www.jlu.edu.cn). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. means algorithm [11] is also developed. The k-means algorithm and the fuzzy k-means algorithm can only deal with numeric data. However, categorical data are frequently encountered in real world applications, and especially in the emerging social media analysis. For instance, clustering Twitter users based on their profiles described by categorical attributes. For clustering categorical data, Huang extended these two classical algorithms and introduced the wellknown k-modes algorithm and fuzzy k-modes algorithm [1214]. However, one issue associated with (fuzzy) k-means and (fuzzy) k-modes algorithms is that they may fall into local optima. To address this issue, many heuristic clustering algorithms, which adopt the optimization procedures in the clustering process, have been proposed. By introducing genetic algorithms (GAs), the GA-based clustering approaches [15], including the genetic k-means algorithm [16], the fast genetic k-means algorithm [17], and the genetic k-modes algorithm [18] have been developed. Among these GA-based clustering algorithms, the genetic k-modes algorithm [18] is suitable for categorical data. In addition, the following heuristic clustering algorithms are used to cluster numeric data: Selim and Al-Sultan introduced a simulated annealing algorithm for the clustering problem [19]. Maulik and Mukhopadhyay introduced a novel fuzzy clustering approach by integrating the simulated annealing heuristic with artificial neural networks [20]. Sung and Jin presented a tabu search-based clustering approach by combining the packing and releasing procedures [21]. Over the last decade, a few approaches have been developed to model the intelligent foraging behavior of social animals, such as birds and ants, for optimization problems, and these approaches have been successfully applied to clustering. Shelokar, Jayaraman, and Kulkarni proposed an ant colony clustering algorithm which simulates the way real ants look for an optimal path from their nest to a food source [22]. Kao, Zahara, and Kao integrated the particle swarm optimization (PSO) approach, which mimics the way birds find the optimal food sources in search space, with the k-means procedure and NelderMead simplex search method for improving the performance of clustering [23]. Unlike Kao's approach, Tunchan proposed a pure PSO approach for clustering [24]. Chuang, Hsiao, and Yang presented an accelerated chaotic map particle swarm optimization (ACPSO) for clustering by integrating the chaotic map particle swarm optimization (CPSO) with an accelerated convergence rate strategy [25]. Wan et al. introduced a clustering algorithm on the basis of the optimization property of bacterial foraging behavior [26]. In recent years, investigating the foraging behavior of honeybees, including the learning, memorising, and information sharing mechanism, has emerged as an interesting research direction in swarm intelligence [27]. Inspired by the foraging behavior of bee swarms in the real world, Lucic and Teodorovi introduced the bee colony optimization heuristic [28], which has been used for solving various engineering and management problems. Karaboga and Basturk pre (...truncated)