A Novel Artificial Bee Colony Based Clustering Algorithm for Categorical Data
May
A Novel Artificial Bee Colony Based Clustering Algorithm for Categorical Data
Jinchao Ji 0 1
Wei Pang 0 1
Yanlin Zheng 0 1
Zhe Wang 0 1
Zhiqiang Ma 0 1
0 1 School of Computer Science and Information Technology, Northeast Normal University , Changchun , China , 2 Key Lab of Intelligent Information Processing of Jilin Universities, Northeast Normal University , Changchun , China , 3 Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University , Changchun , China , 4 School of Natural and Computing Sciences, University of Aberdeen , Aberdeen , United Kingdom, 5 College of Computer Science and Technology, Jilin University , Changchun , China
1 Academic Editor: Fengfeng Zhou, Shenzhen Institutes of Advanced Technology , CHINA
Data with categorical attributes are ubiquitous in the real world. However, existing partitional clustering algorithms for categorical data are prone to fall into local optima. To address this issue, in this paper we propose a novel clustering algorithm, ABC-K-Modes (Artificial Bee Colony clustering based on K-Modes), based on the traditional k-modes clustering algorithm and the artificial bee colony approach. In our approach, we first introduce a one-step k-modes procedure, and then integrate this procedure with the artificial bee colony approach to deal with categorical data. In the search process performed by scout bees, we adopt the multi-source search inspired by the idea of batch processing to accelerate the convergence of ABC-K-Modes. The performance of ABC-K-Modes is evaluated by a series of experiments in comparison with that of the other popular algorithms for categorical data.
-
Funding: This work was supported by the National
Natural Science Foundation of China (NSFC) under
Grant Nos. (21127010, 61202309, http://www.nsfc.
gov.cn/), China Postdoctoral Science Foundation
under Grant No. 2013M530956 (http://res.
chinapostdoctor.org.cn), the UK Economic & Social
Research Council (ESRC): award reference: ES/
M001628/1 (http://www.esrc.ac.uk/), Science and
Technology Development Plan of Jilin province under
Grant No. 20140520068JH (http://www.jlkjt.gov.cn),
Fundamental Research Funds for the Central
As an important technique in data mining, clustering analysis has been used in many fields
[1,2], such as information retrieval [3], social media analysis [4], privacy preserving [5], image
analysis [6], text analysis [7], and bioinformatics [8]. The aim of clustering is to group those
data objects with similar characteristics into the same clusters, and the ones with dissimilar
characteristics into different clusters. Most existing clustering algorithms in the literature
belong to one of the following two types: hierarchical and partitional. Hierarchical clustering
algorithms allocate a group of data objects into a dendrogram of the nested partitions according
to a divisive or agglomerative strategy [9]. While partitional clustering algorithms partition a
set of data objects into a pre-defined number of clusters by optimizing an objective cost
function.
Center-based clustering algorithms are the most popular partitional clustering algorithms.
The k-means algorithm is a widely used center-based partitional clustering algorithm due to its
simplicity and high efficiency [10]. Considering the uncertainty of data objects, the fuzzy
kUniversities under No. 14QNJJ028 (http://www.nenu.
edu.cn), the open project program of Key Laboratory
of Symbolic Computation andKnowledge Engineering
of Ministry of Education, Jilin University under Grant
No. 93K172014K07 (http://www.jlu.edu.cn). The
funders had no role in study design, data collection
and analysis, decision to publish, or preparation of
the manuscript.
Competing Interests: The authors have declared
that no competing interests exist.
means algorithm [11] is also developed. The k-means algorithm and the fuzzy k-means
algorithm can only deal with numeric data. However, categorical data are frequently encountered
in real world applications, and especially in the emerging social media analysis. For instance,
clustering Twitter users based on their profiles described by categorical attributes. For
clustering categorical data, Huang extended these two classical algorithms and introduced the
wellknown k-modes algorithm and fuzzy k-modes algorithm [1214]. However, one issue
associated with (fuzzy) k-means and (fuzzy) k-modes algorithms is that they may fall into local optima.
To address this issue, many heuristic clustering algorithms, which adopt the optimization
procedures in the clustering process, have been proposed. By introducing genetic algorithms
(GAs), the GA-based clustering approaches [15], including the genetic k-means algorithm
[16], the fast genetic k-means algorithm [17], and the genetic k-modes algorithm [18] have
been developed. Among these GA-based clustering algorithms, the genetic k-modes algorithm
[18] is suitable for categorical data. In addition, the following heuristic clustering algorithms
are used to cluster numeric data: Selim and Al-Sultan introduced a simulated annealing
algorithm for the clustering problem [19]. Maulik and Mukhopadhyay introduced a novel fuzzy
clustering approach by integrating the simulated annealing heuristic with artificial neural
networks [20]. Sung and Jin presented a tabu search-based clustering approach by combining the
packing and releasing procedures [21].
Over the last decade, a few approaches have been developed to model the intelligent
foraging behavior of social animals, such as birds and ants, for optimization problems, and these
approaches have been successfully applied to clustering. Shelokar, Jayaraman, and Kulkarni
proposed an ant colony clustering algorithm which simulates the way real ants look for an
optimal path from their nest to a food source [22]. Kao, Zahara, and Kao integrated the particle
swarm optimization (PSO) approach, which mimics the way birds find the optimal food
sources in search space, with the k-means procedure and NelderMead simplex search method
for improving the performance of clustering [23]. Unlike Kao's approach, Tunchan proposed a
pure PSO approach for clustering [24]. Chuang, Hsiao, and Yang presented an accelerated
chaotic map particle swarm optimization (ACPSO) for clustering by integrating the chaotic map
particle swarm optimization (CPSO) with an accelerated convergence rate strategy [25]. Wan
et al. introduced a clustering algorithm on the basis of the optimization property of bacterial
foraging behavior [26].
In recent years, investigating the foraging behavior of honeybees, including the learning,
memorising, and information sharing mechanism, has emerged as an interesting research
direction in swarm intelligence [27]. Inspired by the foraging behavior of bee swarms in the real
world, Lucic and Teodorovi introduced the bee colony optimization heuristic [28], which has
been used for solving various engineering and management problems. Karaboga and Basturk
pre (...truncated)