Incremental Gene Expression Programming Classifier with Metagenes and Data Reduction
Hindawi
Complexity
Volume 2018, Article ID 6794067, 12 pages
https://doi.org/10.1155/2018/6794067
Research Article
Incremental Gene Expression Programming Classifier with
Metagenes and Data Reduction
Joanna Jedrzejowicz
1
2
1
and Piotr Jedrzejowicz
2
Institute of Informatics, Faculty of Mathematics, Physics and Informatics, University of Gdansk, 80-308 Gdansk, Poland
Department of Information Systems, Gdynia Maritime University, 81-225 Gdynia, Poland
Correspondence should be addressed to Joanna Jedrzejowicz;
Received 27 March 2018; Revised 8 October 2018; Accepted 24 October 2018; Published 7 November 2018
Academic Editor: Vincent Labatut
Copyright © 2018 Joanna Jedrzejowicz and Piotr Jedrzejowicz. This is an open access article distributed under the Creative
Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the
original work is properly cited.
The paper proposes an incremental Gene Expression Programming classifier. Its main features include using two-level ensemble
consisting of base classifiers in form of genes and the upper-level classifier in the form of metagene. The approach enables us to
deal with big datasets through controlling computation time using data reduction mechanisms. The user can control the number
of attributes used to induce base classifiers as well as the number of base classifiers used to induce metagenes. To optimize the
parameter setting phase, an approach based on the Orthogonal Experiment Design principles is proposed, allowing for statistical
evaluation of the influence of different factors on the classifier performance. In addition, the algorithm is equipped with a simple
mechanism for drift detection. A detailed description of the algorithm is followed by the extensive computational experiment. Its
results validate the approach. Computational experiment results show that the proposed approach compares favourably with several
state-of-the-art incremental classifiers.
1. Introduction
Learning from the environment through data mining remains
an important research challenge. Numerous approaches,
algorithms, and techniques have been proposed during recent
years to deal with the data mining tasks. An important part of
these efforts focuses on mining big datasets and data streams.
Barriers posed by a sheer size of the real-life datasets, on one
side, and constraints on the resources available for performing the data mining task, including time and computational
resources, on the other, are not easy to overcome. Additional
complications, apart from the above-mentioned complexity
issues, are often encountered due to the nonstationary environments.
One of the most effective approaches to mining big datasets and data streams is using online or incremental learners.
Online learning assumes dealing strictly with data streams.
Online learners should have the following properties [1]:
(i) Single-pass through the data.
(ii) Each example is processed very fast and in a constant
period of time.
(iii) Any-time learning: the classifier should provide the
best answer at every moment of time.
The incremental learning is understood as a slightly wider
concept, as compared with the online learning one. Incremental learners can deal not only with data streams but also
with big datasets stored in databases for which using the “oneby-one” or “chunk-by-chunk” approach could be more effective than using the traditional “batch” learners, even if no
concept drift has been detected. An important feature of the
incremental learners is their ability to update the currently
used model using only newly available individual data instances, without having to reprocess all of the past instances.
In fact, using incremental learners is, quite often, the
only possible way to extract any meaningful knowledge.
Usual for the contemporary databases is a constant inflow
of new data instances. Hence, the knowledge discovered in
databases needs to be constantly updated, which is usually
an infeasible task for classic learners. Data streams, and even
stored datasets, may be affected by the so-called concept drift.
In the above cases, online or incremental learners are needed.
2
In the paper, we propose a new version of the incremental
classifier based on Gene Expression Programming (GEP)
with data reduction and a metagene as the final, upper-level,
classifier. Classifiers using the GEP-induced expression trees
are known to produce satisfactory or very good results in
terms of the classification accuracy. Our approach uses GEPinduced expression trees to construct learners with the ability
to deal with large datasets environment and with a concept
drift phenomenon. The rest of the paper is organized as follows. In Section 2 a brief survey of the related results is offered. In Section 3 we describe a new version of the proposed
approach. Section 4 contains a detailed description of the
validating computational experiment and a discussion of its
results including suggestions on how to deal with the real-life
datasets through the Orthogonal Experiment Design technique. Section 5 includes conclusions and ideas for future research.
2. Related Work
To meet the required properties of the online learners several
approaches and techniques have been proposed in the literature. The most successful ones include sampling, windowing,
and drift detecting. Sampling assumes using only some data
instances or some part of instances out of the available
dataset. In [14] random sampling strategy with a probabilistic
removal of some instances from the training set was proposed. Later on, the idea was extended in [15]. Some more
advanced sampling strategies were proposed in [16]. Effects of
sampling strategy on classification accuracy were investigated
in [17].
As it has been observed in the review of [18], data sampling methods for machine learning have been investigated
for decades. According to the above paper, in recent years
progress has been made in methods that can be broadly categorized into random sampling including density-biased and
nonuniform sampling methods, active learning methods,
which are the type of semisupervised learning, and progressive sampling methods, which can be viewed as a combination of the above two approaches.
Closely related to sampling is the sliding window model.
Sliding window can be seen a subset that runs over an
underlying collection. Several versions of the approach can be
found in [19–21]. The idea is that analysis of the data stream is
based on recent instances only and a limited number of the
data instances, usually equal to the window size, are used to
induce a classifier. In machine learning, the concept can be
used for incremental mining of association rules [22]. Another interesting application of the sliding window technique is
known as the high utility pattern mining [23].
For noisy environments or environments with a concept
drift the key question is when and how the current model
shoul (...truncated)