Incremental Gene Expression Programming Classifier with Metagenes and Data Reduction (pdf)

Article PDF cannot be displayed. You can download it here:

http://downloads.hindawi.com/journals/complexity/2018/6794067.pdf

Incremental Gene Expression Programming Classifier with Metagenes and Data Reduction

Hindawi Complexity Volume 2018, Article ID 6794067, 12 pages https://doi.org/10.1155/2018/6794067 Research Article Incremental Gene Expression Programming Classifier with Metagenes and Data Reduction Joanna Jedrzejowicz 1 2 1 and Piotr Jedrzejowicz 2 Institute of Informatics, Faculty of Mathematics, Physics and Informatics, University of Gdansk, 80-308 Gdansk, Poland Department of Information Systems, Gdynia Maritime University, 81-225 Gdynia, Poland Correspondence should be addressed to Joanna Jedrzejowicz; Received 27 March 2018; Revised 8 October 2018; Accepted 24 October 2018; Published 7 November 2018 Academic Editor: Vincent Labatut Copyright © 2018 Joanna Jedrzejowicz and Piotr Jedrzejowicz. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The paper proposes an incremental Gene Expression Programming classifier. Its main features include using two-level ensemble consisting of base classifiers in form of genes and the upper-level classifier in the form of metagene. The approach enables us to deal with big datasets through controlling computation time using data reduction mechanisms. The user can control the number of attributes used to induce base classifiers as well as the number of base classifiers used to induce metagenes. To optimize the parameter setting phase, an approach based on the Orthogonal Experiment Design principles is proposed, allowing for statistical evaluation of the influence of different factors on the classifier performance. In addition, the algorithm is equipped with a simple mechanism for drift detection. A detailed description of the algorithm is followed by the extensive computational experiment. Its results validate the approach. Computational experiment results show that the proposed approach compares favourably with several state-of-the-art incremental classifiers. 1. Introduction Learning from the environment through data mining remains an important research challenge. Numerous approaches, algorithms, and techniques have been proposed during recent years to deal with the data mining tasks. An important part of these efforts focuses on mining big datasets and data streams. Barriers posed by a sheer size of the real-life datasets, on one side, and constraints on the resources available for performing the data mining task, including time and computational resources, on the other, are not easy to overcome. Additional complications, apart from the above-mentioned complexity issues, are often encountered due to the nonstationary environments. One of the most effective approaches to mining big datasets and data streams is using online or incremental learners. Online learning assumes dealing strictly with data streams. Online learners should have the following properties [1]: (i) Single-pass through the data. (ii) Each example is processed very fast and in a constant period of time. (iii) Any-time learning: the classifier should provide the best answer at every moment of time. The incremental learning is understood as a slightly wider concept, as compared with the online learning one. Incremental learners can deal not only with data streams but also with big datasets stored in databases for which using the “oneby-one” or “chunk-by-chunk” approach could be more effective than using the traditional “batch” learners, even if no concept drift has been detected. An important feature of the incremental learners is their ability to update the currently used model using only newly available individual data instances, without having to reprocess all of the past instances. In fact, using incremental learners is, quite often, the only possible way to extract any meaningful knowledge. Usual for the contemporary databases is a constant inflow of new data instances. Hence, the knowledge discovered in databases needs to be constantly updated, which is usually an infeasible task for classic learners. Data streams, and even stored datasets, may be affected by the so-called concept drift. In the above cases, online or incremental learners are needed. 2 In the paper, we propose a new version of the incremental classifier based on Gene Expression Programming (GEP) with data reduction and a metagene as the final, upper-level, classifier. Classifiers using the GEP-induced expression trees are known to produce satisfactory or very good results in terms of the classification accuracy. Our approach uses GEPinduced expression trees to construct learners with the ability to deal with large datasets environment and with a concept drift phenomenon. The rest of the paper is organized as follows. In Section 2 a brief survey of the related results is offered. In Section 3 we describe a new version of the proposed approach. Section 4 contains a detailed description of the validating computational experiment and a discussion of its results including suggestions on how to deal with the real-life datasets through the Orthogonal Experiment Design technique. Section 5 includes conclusions and ideas for future research. 2. Related Work To meet the required properties of the online learners several approaches and techniques have been proposed in the literature. The most successful ones include sampling, windowing, and drift detecting. Sampling assumes using only some data instances or some part of instances out of the available dataset. In [14] random sampling strategy with a probabilistic removal of some instances from the training set was proposed. Later on, the idea was extended in [15]. Some more advanced sampling strategies were proposed in [16]. Effects of sampling strategy on classification accuracy were investigated in [17]. As it has been observed in the review of [18], data sampling methods for machine learning have been investigated for decades. According to the above paper, in recent years progress has been made in methods that can be broadly categorized into random sampling including density-biased and nonuniform sampling methods, active learning methods, which are the type of semisupervised learning, and progressive sampling methods, which can be viewed as a combination of the above two approaches. Closely related to sampling is the sliding window model. Sliding window can be seen a subset that runs over an underlying collection. Several versions of the approach can be found in [19–21]. The idea is that analysis of the data stream is based on recent instances only and a limited number of the data instances, usually equal to the window size, are used to induce a classifier. In machine learning, the concept can be used for incremental mining of association rules [22]. Another interesting application of the sliding window technique is known as the high utility pattern mining [23]. For noisy environments or environments with a concept drift the key question is when and how the current model shoul (...truncated)