Subsumption resolution: an efficient and effective technique for semi-naive Bayesian learning

Machine Learning, Apr 2012

Semi-naive Bayesian techniques seek to improve the accuracy of naive Bayes (NB) by relaxing the attribute independence assumption. We present a new type of semi-naive Bayesian operation, Subsumption Resolution (SR), which efficiently identifies occurrences of the specialization-generalization relationship and eliminates generalizations at classification time. We extend SR to Near-Subsumption Resolution (NSR) to delete near–generalizations in addition to generalizations. We develop two versions of SR: one that performs SR during training, called eager SR (ESR), and another that performs SR during testing, called lazy SR (LSR). We investigate the effect of ESR, LSR, NSR and conventional attribute elimination (BSE) on NB and Averaged One-Dependence Estimators (AODE), a powerful alternative to NB. BSE imposes very high training time overheads on NB and AODE accompanied by varying decreases in classification time overheads. ESR, LSR and NSR impose high training time and test time overheads on NB. However, LSR imposes no extra training time overheads and only modest test time overheads on AODE, while ESR and NSR impose modest training and test time overheads on AODE. Our extensive experimental comparison on sixty UCI data sets shows that applying BSE, LSR or NSR to NB significantly improves both zero-one loss and RMSE, while applying BSE, ESR or NSR to AODE significantly improves zero-one loss and RMSE and applying LSR to AODE significantly improves zero-one loss. The Friedman test and Nemenyi test show that AODE with ESR or NSR have a significant zero-one loss and RMSE advantage over Logistic Regression and a zero-one loss advantage over Weka’s LibSVM implementation with a grid parameter search on categorical data. AODE with LSR has a zero-one loss advantage over Logistic Regression and comparable zero-one loss with LibSVM. Finally, we examine the circumstances under which the elimination of near-generalizations proves beneficial.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://link.springer.com/content/pdf/10.1007%2Fs10994-011-5275-2.pdf

Subsumption resolution: an efficient and effective technique for semi-naive Bayesian learning

Fei Zheng Geoffrey I. Webb Pramuditha Suraweera Liguang Zhu Editors: Mark Craven and Johannes Frnkranz. Semi-naive Bayesian techniques seek to improve the accuracy of naive Bayes (NB) by relaxing the attribute independence assumption. We present a new type of seminaive Bayesian operation, Subsumption Resolution (SR), which efficiently identifies occurrences of the specialization-generalization relationship and eliminates generalizations at classification time. We extend SR to Near-Subsumption Resolution (NSR) to delete neargeneralizations in addition to generalizations. We develop two versions of SR: one that performs SR during training, called eager SR (ESR), and another that performs SR during testing, called lazy SR (LSR). We investigate the effect of ESR, LSR, NSR and conventional attribute elimination (BSE) on NB and Averaged One-Dependence Estimators (AODE), a powerful alternative to NB. BSE imposes very high training time overheads on NB and AODE accompanied by varying decreases in classification time overheads. ESR, LSR and NSR impose high training time and test time overheads on NB. However, LSR imposes no extra training time overheads and only modest test time overheads on AODE, while ESR and NSR impose modest training and test time overheads on AODE. Our extensive experimental comparison on sixty UCI data sets shows that applying BSE, LSR or NSR to NB significantly improves both zero-one loss and RMSE, while applying BSE, ESR or NSR to AODE significantly improves zero-one loss and RMSE and applying LSR to AODE significantly improves zero-one loss. The Friedman test and Nemenyi test show that AODE with ESR or NSR have a significant zero-one loss and RMSE advantage over Logistic Regression and a zero-one loss advantage over Weka's LibSVM implementation with a grid parameter search on categorical data. AODE with LSR has a zero-one loss advantage over Logistic Regression and comparable zero-one loss with LibSVM. Finally, we examine the circumstances under which the elimination of near-generalizations proves beneficial. 1 Introduction Naive Bayes (NB) is a simple, computationally efficient and effective probabilistic approach to classification learning (Domingos and Pazzani 1996; Mitchell 1997; Lewis 1998; Hand and Yu 2001). It has many desirable features including the ability to directly handle missing values in a manner that minimizes information loss, learning in a single pass through the training data, support for incremental learning and a lack of parameters, avoiding the need parameter tuning. NB is built on the assumption of conditional independence between the attributes given the class. However, violations of this conditional independence assumption can render NBs classification sub-optimal. We present Subsumption Resolution (SR), a new type of semi-naive Bayesian operation that identifies pairs of attribute-values such that one is a generalization of the other and deletes the generalization. SR can be applied at either training time or classification time. We show that this adjustment is theoretically correct and demonstrate experimentally that it can considerably improve both zero-one loss and RMSE. This paper provides a substantially expanded presentation of the SR technique, which was first presented in Zheng and Webb (2006) under the potentially misleading name Lazy Elimination. The major extensions to the earlier paper include two new subsumption resolution techniques, Eager Subsumption Resolution (ESR) which performs SR at training time and Near Subsumption Resolution (NSR) which extends the approach to near generalizations; an exploration of reasons for high percentages of generalizations on three data sets; an investigation of the situations under which the elimination of near-generalizations appears to be beneficial; a study of the effect of SR on RMSE in addition to zero-one loss; and an empirical comparison of SR applied to NB and AODE to Logistic Regression and Wekas LibSVM implementation. Subsumption (De Raedt 2010a) is a central concept in Inductive Logic Programming (De Raedt 2010b), where it is used to identify generalization-specialization relationships between clauses and to support the process of unifying clauses. In this work we use it for an alternative purpose, the efficient identification and resolution of a specific form of extreme violation of the attribute-independence assumption. The remainder of the paper is organized as follows. NB and AODE are introduced in the following sections. Section 4 introduces the BSE technique for feature selection with NB and AODE. The theoretical justification of SR and NSR is given in Sect. 5. NB and AODE with SR and NSR are detailed in Sect. 6. The computational complexities of all the variants of NB and AODE are presented in Sect. 7. Section 8 contains a detailed analysis of the effectiveness of all NB and AODE variants. The final section presents conclusions and future directions. 2 Naive Bayes (NB) The task of supervised classification learning algorithms is to build a classifier from a labelled training sample such that the classifier can predict a discrete class label y {c1, . . . , ck } for a test instance x = x1, . . . , xn , where xi is the value of the ith attribute Xi and ci is the ith value of the class variable Y . The Bayesian classifier (Duda and Hart 1973) performs classification by assigning x to arg maxy P(y | x). From the definition of conditional probability we have As P(x) is invariant across values of y, we have the following. P(y | x) = P(y, x)/P(x). arg max P(y | x) = arg max P(y, x) y y P (y | x) = P (y, x) i=1 Where estimates of P(y | x) are required rather than a simple classification, these can be obtained by normalization, where P () represents an estimate of P(). For ease of explication, we describe NB and its variants by the manner in which each calculates the estimate P (y, x). This estimate is then utilized with (1) or (2) to perform respectively classification or conditional probability estimation. Naive Bayes (NB) (Kononenko 1990; Langley et al. 1992; Langley and Sage 1994) makes an assumption that the attributes are independent given the class and estimates P(y, x) by P (y, x) = P (y) i=1 NB is simple and computationally efficient. At training time, it generates a onedimensional table of prior class probability estimates, indexed by class, and a twodimensional table of conditional attribute-value probability estimates, indexed by class and attribute-value. If all attributes have discrete values this requires only a single scan of the training data. The time complexity of calculating the estimates is O(tn), where t is the number of training examples. The resulting space complexity is O(knv), where v is the mean number of values per attribute. At classification time, to classify a single example has time complexity O(kn) using the tables formed at training time with space complexity O(knv). NB uses a fixed formula to perform classification, and hence there is no model selection. This may minimize the variance component of a classifiers error (Hastie et al. 2001). Since it only needs to update the probability estimates when a new training instance becomes available, it is suited to incremental learning. Although the attribute independence assumption is frequently unrealistic, NB has exhibited accuracy competitive with other learning algorithms for many tasks. 3 Averaged one-dependence estimators (AODE) Numerous techniques have sought to enhance the accuracy of NB by relaxing the attribute independence assumption. We refer to these as semi-naive Bayesian methods. Previous semi-naive Bayesian methods can be roughly subdivided into five groups. The first group uses a z-dependence classifier (Sahami 1996), in which each attribute depends upon the class and at most z other attributes. Within this framework, NB is a 0-dependence classifier. Examples include Tree Augmented Naive Bayes (TAN) (Friedman et al. 1997), Super Parent TAN (SP-TAN) (Keogh and Pazzani 1999), NBTree (Kohavi 1996), Lazy Bayesian Rules (LBR) (Zheng and Webb 2000) and Averaged One-Dependence Estimators (AODE) (Webb et al. 2005). The second group remedies violations of the attribute independence assumption by deleting strongly related attributes (Kittler 1986; Langley 1993; Pazzani 1996). Backwards Sequential Elimination (BSE) (Kittler 1986) uses a simple heuristic wrapper approach that seeks a subset of the available attributes that minimizes zero-one loss on the training set. This has proved to be beneficial in domains with highly correlated attributes. However, it has high computational overheads, especially on learning algorithms with high classification time complexity, as it applies the algorithms repeatedly until there is no accuracy improvement. Forward Sequential Selection (FSS) (Langley and Sage 1994) uses the reverse search direction to BSE. The third group applies NB to a subset of training instances (Langley 1993; Frank et al. 2003). Note that the second and third groups are not mutually exclusive. For example, NBTree and LBR classify instances by applying NB to a subset of training instances, and hence they can also be categorized to the third group. The fourth group performs adjustments to the output of NB without altering its direct operation (Hilden and Bjerregaard 1976; Webb and Pazzani 1998; Platt 1999; Zadrozny and Elkan 2001, 2002; Gama 2003). The fifth group introduces hidden variables to NB (Kononenko 1991; Pazzani 1996; Zhang et al. 2004, 2005; Langseth and Nielsen 2006). Domingos and Pazzani (1996) point out that interdependence between attributes will not affect NBs zero-one loss, so long as it can generate the correct ranks of conditional probabilities for the classes. However, the success of semi-naive Bayesian methods show that appropriate relaxation of the attribute independence assumption is effective. Further, in many applications it is desirable to obtain accurate estimates of the conditional class probability rather than a simple classification, and hence mere correct ranking will not suffice. Of the z-dependence classifier approaches to relaxing the attribute conditional independence assumption, those such as TAN, SP-TAN and AODE that restrict themselves to one-dependence classifiers readily admit to efficient computation. To avoid model selection while attaining the efficiency and efficacy of one-dependence classifiers, Averaged One-Dependence Estimators (AODE) (Webb et al. 2005) utilizes a restricted class of onedependence estimators (ODEs) and aggregates the predictions of all qualified estimators within this class. A single attribute, called a super parent, is selected as the parent of all the other attributes in each ODE. In order to avoid unreliable base probability estimates, when classifying an instance x the original AODE excludes ODEs with parent xi where the frequency of the value xi is lower than limit m = 30, a widely used minimum on sample size for statistical inference purposes. However, subsequent research (Cerquides and Mntaras 2005) reveals that this constraint actually increases error and hence the current research uses m = 1. For any attribute value xi , P(y, x) = P(y, xi )P(x | y, xi ). P(y, x) = where | | denotes the cardinality of a set. P(y, x) = i:1inF(xi )m P(y, xi )P(x | y, xi ) |{i : 1 i n F(xi ) m}| where F(xi ) is the frequency of attribute-value xi in the training sample. AODE utilizes (3) and, for each ODE, an assumption that the attributes are independent given the class and the privileged attribute value xi , estimating P(y, x) by P (y, x) = i:1inF(xi )m P(y, xi ) |{i : 1 i n F(xi ) m}| At training time AODE generates a three-dimensional table of probability estimates for each attribute-value, indexed by each other attribute-value and each class. The resulting space complexity is O(k(nv)2). The time complexity of forming this table is O(tn2), as an entry must be updated for every training case and every combination of two attribute-values for that case. Classification requires the tables of probability estimates formed at training time, which have space complexity O(k(nv)2). The time complexity of classifying a single example is O(kn2), as we need to consider each pair of qualified parent and child attributes within each class. As AODE makes a weaker attribute conditional independence assumption than NB while still avoiding model selection, it has substantially lower bias with a very small increase in variance. Previous studies have demonstrated that it has a considerably lower bias than NB with moderate increases in variance and time complexity (Webb et al. 2005) and that AODE has a significant advantage in average error over many other semi-naive Bayesian algorithms, with the exceptions of LBR (Zheng and Webb 2000) and SP-TAN (Keogh and Pazzani 1999). It shares a similar level of average error with these two algorithms without the prohibitive training time of SP-TAN or test time of LBR (Zheng and Webb 2005). When a new instance is available, like NB, it only needs to update the probability estimates. Therefore, it is also suited to incremental learning. Dash and Cooper (2002) present Exact Model Averaging with NB to efficiently average NB predictions over all possible attribute subsets. The difference between this method and AODE is that the former is a 0-dependence classifier that uses an attribute subset in each ensembled classifier and performs model averaging over all 2n possible attribute subsets while the latter is a 1-dependence classifier that does not exclude any attributes in any ensembled classifier and performs model averaging over n possible super parents. 4 Backwards sequential elimination One approach to repairing harmful interdependencies is to remove highly correlated attributes. Backwards Sequential Elimination (BSE) (Kittler 1986) selects a subset of attributes using leave-one-out cross validation zero-one loss as a selection criterion. Starting from the full set of attributes, BSE operates by iteratively removing successive attributes, each time removing the attribute whose elimination best reduces training set zero-one loss. This process is terminated if there is no zero-one loss improvement. BSE does not support incremental learning as it has to reselect the subset of attributes when a new training instance becomes available. 4.1 NB with BSE NB with BSE (NBBSE) selects a subset of attributes using leave-one-out cross validation zero-one loss on NB as a selection criterion and applies NB to the new attribute set. The subset of selected attributes is denoted as L. Independence is assumed among the resulting attributes given the class. Hence, NBBSE estimates P(y, x) by P (y, x) = P (y) xL At training time NBBSE generates a two-dimensional table of probability estimates as NB does. As it performs leave-one-out cross validation to select the subset of attributes, it must also store the training data, with additional space complexity O(tn). Keogh and Pazzani (1999) speed up the process of evaluating the classifiers by using a two-dimensional table, indexed by instance and class, to store the probability estimates, with space complexity O(t k). Since k is usually much less than n, the resulting space complexity is O(tn + knv). The time complexity of a single leave-one-out cross validation is reduced from O(tkn) to O(t k) by using the speed up strategy, and the total time complexity of attribute selection is O(tkn2), as leave-one-out cross validation will be performed at most O(n2) times. NBBSE has identical time and space complexity to NB at classification time. 4.2 AODE with BSE In the context of AODE, BSE uses leave-one-out cross validation zero-one loss on AODE as the deletion criterion, and averages the predictions of all qualified classifiers using the resulting attribute set. Because attributes play multiple roles, either parent or child, in an AODE model, there are four types of attribute elimination for AODE (Zheng and Webb 2007). To formalize the various attribute elimination strategies we introduce into AODE the use of a parent (p) and a child (c) set, each of which contains the set of indices of attributes that can be employed in respectively a parent or child role in AODE. All four types of attribute elimination start with p and c initialized to the full set. The first approach, called parent elimination (PE), deletes attribute indexes from p, effectively deleting a single ODE at each step. The second approach, called child elimination (CE), deletes attribute indexes from c, effectively deleting an attribute from every ODE at each step. Parent and child elimination (PCE) (Zheng and Webb 2006) at each step deletes the same value from both p and c, thus eliminating it from use in any role in the classifier. Parent or child elimination (PCE) performs any one of the other types of attribute eliminations in each iteration, selecting the option that best reduces zero-one loss. These four types of attribute elimination for AODE estimate P(y, x) by P (y, x) = ip:F(xi )m P (y, xi ) jc P (xj | y, xi ) |{i : i p F(xi ) m}| Zheng and Webb reported that the types of attribute elimination that remove child attributes from within the constituent ODEs can significantly reduce bias and error, but only if a statistical test is employed to provide variance management.1 In this paper, of the strategies that use child elimination, we select PCE, as it leads to more efficient classification. We use AODEBSE to indicate AODE with PCE. 1A standard binomial sign test is used to assess whether an improvement is significant. Treating the examples for which an attribute deletion corrects a misclassification as a win and one for which it misclassifies a At training time AODEBSE generates a three-dimensional table of probability estimates, as AODE does. A three-dimensional table, indexed by instance, class and attribute, is introduced to speed up the process of evaluating the classifiers, with space complexity O(tkn). Therefore, the resulting space complexity is O(tkn + k(nv)2). Deleting attributes has time complexity of O(tkn3), as a single leave-one-out cross validation is order O(tkn) and it is performed at most O(n2) times. AODEBSE has identical time and space complexity to AODE at classification time. 5 Related attribute-values and subsumption resolution This section introduces an extreme type of interdependence between attribute values and presents adjustments for such an interdependence relationship. 5.1 The generalization, substitution and duplication relationships One extreme type of inter-dependence between attributes results in a value of one being a generalization of a value of the other. For example, let Gender and Pregnant be two attributes. Gender has two values: female and male, and Pregnant has two values: yes and no. If Pregnant = yes, it follows that Gender = female. Therefore, Gender = female is a generalization of Pregnant = yes. Likewise, Pregnant = no is a generalization of Gender = male. We formalize this relationship as: Definition 1 (Generalization and specialization) For two attribute values xi and xj , if P(xj | xi ) = 1.0 then xj is a generalization of xi and xi is a specialization of xj . In a special case when xi is a generalization and specialization of xj , xi is a substitution of xj . Definition 2 (Substitution) For two attribute values xi and xj , if P(xj | xi ) = 1.0 and P(xi | xj ) = 1.0, xi is a substitution of xj and so is xj of xi . For two attributes Xi and Xj , we say that Xi is a substitution of Xj if the following condition holds: abP(xjb | xia ) = P(xia | xjb) = 1.0, where a, b {1, . . . , |Xi |}, xia is the ath value of Xi and xjb is the bth value of Xj . Definition 3 (Duplication) For two attribute values xi and xj , if xi is a substitution of xj and xi = xj then xi is a duplication of xj . For two attributes Xi and Xj , we say that Xi is a duplication of Xj if xi = xj for all instances. In Table 1(a), because P(Xj = 0 | Xi = 0) = 1.0 and P(Xi = 1 | Xj = 1) = 1.0, Xj = 0 is a generalization of Xi = 0 and Xi = 1 is a generalization of Xj = 1. Table 1(b) illustrates an example of substitution. P(Xj = 2 | Xi = 0) = 1.0 and P(Xi = 0 | Xj = 2) = 1.0, hence Xj = 2 is a substitution of Xi = 0 and so is Xi = 0 of Xj = 2. previously correct example as a loss, a change is accepted if the number of wins exceeds the number of losses and the probability of obtaining the observed number of wins and losses if they were equiprobable was no more than 0.05. Fig. 1 Relationship between Duplication, Substitution, Specialization and Generalization Likewise, Xj = 0 is a substitution of Xi = 1 and so is Xi = 1 of Xj = 0. As both Xi = 0 and Xi = 1 have substitutions, Xi is a substitution of Xj . As illustrated in Table 1(c), Xi is a duplication of Xj . It is interesting that the specialization-generalization relationship can be defined in terms of the definitions of Generalization, Specialization, Substitution and Duplication. A duplication is a special form of substitution. A substitution is a generalization that is also a specialization. This relationship is illustrated in Fig. 1. The generalization relationship is very common in the real world. For example, City = Melbourne is a specialization of Country = Australia and CountryCode = 61 is a substitution of Country = Australia. Given an example with City = Melbourne, Country = Australia and CountryCode = 61, NB will effectively give three times the weight to evidence relating to Country = Australia relative to the situation if only one of these attributes were considered. Ignoring such redundancy may reduce NBs zero-one loss and improve the accuracy of its probability estimates. The next section is devoted to resolving this problem. 5.2 Subsumption resolution (SR) and near-subsumption resolution (NSR) Subsumption Resolution (SR) (Zheng and Webb 2006) identifies pairs of attribute values such that one appears to subsume (be a generalization of) the other and deletes the generalization. Near-Subsumption Resolution (NSR) is a variant of SR. It extends SR by deleting not only generalizations but also near-generalizations. 5.2.1 Subsumption resolution (SR) Theorem If xj is a generalization of xi , 1 i n, 1 j n, i =j then P(y | x1, . . . , xn) = P(y | x1, . . . , xj1, xj+1, . . . xn). Table 2 A hypothetical example Proof Note, Z, given P(xj | xi ) = 1.0, it follows that P(Z | xi , xj ) = P(Z | xi ) and hence P(xi , xj , Z) = P(xi , Z). Therefore, P(y, x1, . . . , xj1, xj+1, . . . , xn) P(x1, . . . , xj1, xj+1, . . . , xn) = P(y | x1, . . . , xj1, xj+1, . . . , xn) Given P(y | x1, . . . , xn) = P(y | x1, . . . , xj1, xj+1, . . . xn) and x1, . . . , xj1, xj+1, . . . xn are observed, deleting the generalization xj from a Bayesian classifier should not be harmful. Further, such deletion may improve a classifiers estimates if the classifier makes unwarranted assumptions about the relationship of xj to the other attributes when estimating intermediate probability values, such as NBs independence assumption. To illustrate this, consider the data presented in Table 2 for a hypothetical example with three attributes Gender, Pregnant and MaleHormone and class Normal. Pregnant = yes is a specialization of Gender = female and Gender = male is a specialization of Pregnant = no. As these two attributes are highly related, NB will misclassify the object Gender = male, Pregnant = no, MaleHormone = 3 as Normal = no, even though it occurs in the training data. In effect NB double counts the evidence from Pregnant = no, due to the presence of its specialization Gender = male. The new object can be correctly classified as Normal = yes by deleting attribute value Pregnant = no. In contrast, if Gender = female we cannot make any definite conclusion about the value of Pregnant, nor about the value of Gender if Pregnant = no. If both of these values (Gender = female and Pregnant = no) are present, deleting either one will lose information. Therefore, if neither attribute-value is a generalization of the other, both should be used for classification. In the case when xi is a substitution of xj (P(xj | xi ) = 1.0 and P(xi | xj ) = 1.0), only one of the two attribute-values should be used for classification. Note that simple attribute selection, such as BSE, cannot resolve such interdependencies, as for some test instances one attribute should be deleted, for other test instances a different attribute should be deleted, and for still further test instances no attribute should be deleted. For a test instance x = x1, . . . , xn , after SR the resulting attribute set consists of nongeneralization attributes and substitution attributes. We denote the set of indices of attributes that are not a generalization of any other attributes as G = {i | 1 i n, 1 j n, i =j, P(xi | xj ) = 1}. For substitutions, we keep the attribute with the smallest index and delete the other attributes. For instance, if x1, x3 and x4 are substitutions of each other, we only use x1 for classification. We denote the set of indices of resulting substitutions as S = i=1 Si = {j | 1 j n, P(xi | xj ) = 1 P(xj | xi ) = 1} and min(Si ) is if Si = and the smallest index in Si otherwise. The set of indices of the resulting attribute subset is G S. SR requires a method for inferring from the training data whether one attribute value is a generalization of another. It uses the criterion |Txi | = |Txi ,xj | l to infer that xj is a generalization of xi , where |Txi | is the number of training cases with value xi , |Txi ,xj | is the number of training cases with both values, and l is a user-specified minimum frequency. 5.2.2 Near-subsumption resolution (NSR) It is possible that noisy or erroneous data might prevent detection of a specialization generalization relationship. Further, as we can only infer whether a specialization generalization relationship exists, it is likely that in some cases we assume one does when in fact the relationship is actually a near specializationgeneralization relationship. In consequence, we investigate deletions of near-generalizations as well. Definition 4 (Near-generalization, near-specialization and near-substitution) For two attribute values xi and xj , if P(xj | xi ) P(xi | xj ) and P(xj | xi ) 1.0, we say that xj is a near-generalization of xi and xi is a near-specialization of xj . If P(xj | xi ) = P(xi | xj ) and P(xj | xi ) 1.0, we say that xj is a near-substitution of xi and so is xi of xj . In this research, P(xj | xi ) is used to estimate how approximately xi and xj have the specializationgeneralization relationship. Let r be a user-specified lower bound, 0 r 1.0. If P (xj | xi ) P(xi | xj ) and P(xj | xi ) r 1.0, xj is a near-generalization of xi . As P(xi , Z) (1 P(xj , xi ))/P(xi ) P(xi , xj , Z) P(xi , Z) + (1 P(xj , xi ))/P(xi ), when P(xj | xi ) 1.0, we have P(xi , xj , Z) P(xi , Z), and hence P(y | x1, . . . , xn) P(y | x1, . . . , xj1, xj+1, . . . , xn). If an appropriate r is selected, removing xj from a Bayesian classifier might positively affect a Bayesian classifier. However, in the absence of domain specific knowledge, there does not appear to be any satisfactory a priori method to select an appropriate value for r . Deleting weak near-generalizations might prove effective on some data sets, while only eliminating strong near-generalization may prove more desirable on other data sets. One practical approach to selecting r is to perform a parameter search, finding the value with the lowest leave-one-out cross validation zero-one loss. To provide variance management, a statistical test can be used to assess whether an zero-one loss reduction resulting from using an r value is significant. SR can be simply extended to manipulate the near specializationgeneralization relationship by using the criterion |Txj | |Txi | |Txi ,xj | r |Txi | |Txi ,xj | l to infer that xj is a near-generalization or perfect generalization (when |Txi ,xj | = |Txi |) of xi . G (4) can be extended to the set of indices of attributes N G that are not a neargeneralization or perfect generalization of any other attributes by substituting P(xi | xj ) r for P(xi | xj ) = 1. S (5) can be extended to N S, the set of indices of resulting nearsubstitutions or perfect substitutions, by substituting P(xi | xj ) r P(xj | xi ) r for P(xi | xj ) = 1 P(xj | xi ) = 1 in (6). This extension is called Near-Subsumption Resolution (NSR). 6 NB and AODE with SR and NSR Attribute values that are subsumed by others can be either identified during training time or classification time. Eager learning, which identifies subsumed attribute-values during training time, transforms the data prior to training the classifier and is independent of the classification algorithm. On the other hand, lazy learning, deletes attributes at classification time based on the attribute-values that are instantiated in the instance being classified. Although, this is suited to probabilistic techniques, such as NB and AODE, it is not suited for similarity techniques, such as k-nearest neighbours. This can be illustrated by a simple example in which there are two attributes Pregnant and Gender, the test instance is Gender = female, Pregnant = yes and the distance between two instances is defined as the number of attributes that have different values. The distance between the test instance and Gender = female, Pregnant = no is one and that of the test instance and Gender = male, Pregnant = no is two. In such a case, attribute Gender is important to measure the similarity and hence deleting attribute value Gender = female of the test instance is clearly not correct, which process results in both distances equal one. 6.1 Lazy subsumption resolution The lazy versions of SR (LSR) and NSR delay the computation of elimination until classification time. They delete different attributes depending upon which attribute values are instantiated in the object being classified, that is, different attributes may be used to classify different test instances. Consequently, LSR can only be applied to algorithms which can use different attributes for different test instances. When LSR is applied to NB or AODE, the resulting classifier acts as NB or AODE except that it deletes generalization attribute-values if a specialization is detected. We denote NB and AODE with Lazy Subsumption Resolution as NBLSR and AODELSR respectively. As LSR eliminates highly dependent attribute values in a lazy manner, it does not interfere with NB and AODEs capacity for incremental learning. Classification of instance x = x1, . . . , xn consists of two steps: 1. Set R to G S (refer to (4) and (5)). 2. Estimate P(y, x) by iR P(xi | y) where F(xi ) is the frequency of xi and m is the minimum frequency to accept xi as a super parent. NBLSR generates at training time a two-dimensional table of probability estimates for each attribute-value, conditioned by each other attribute-value in addition to the two probability estimate tables generated by NB, resulting in a space complexity of O(knv + (nv)2). The time complexity of forming the additional two-dimensional probability estimate table is O(tn2). Classification of a single example requires considering each pair of attributes to detect dependencies and is of time complexity O(n2 + kn). The space complexity is O(knv + (nv)2). AODELSR has identical time and space complexity to AODE. At training time it behaves identically to AODE. At classification time, it must check all attribute-value pairs for generalization relationships, an additional operation of time complexity O(n2). However, the time complexity of AODE at classification time is O(kn2) and so this additional computation does not increase the overall time complexity. When NSR is applied to NB or AODE, R is extended to NG NS (refer to Sect. 5.2.2). If r is pre-selected, this extension does not incur any additional computational complexity compared to the original LSR as it only changes the criterion to accept the relationship. However, if we perform a parameter search to select r by using leave-one-out cross validation, the training time complexity of NBNSR and AODENSR will be O(tn2 + tnk) and O(tkn2) respectively. This also incurs an additional space complexity O(tn) to store the training data. 6.2 Eager subsumption resolution Eager subsumption resolution (ESR) eliminates subsumed attribute-values at training time by transforming the training data. They are identified using a two-dimensional table of probability estimates for each attribute-value, conditioned by each other attribute-value. Each attribute value that is xj subsumed by another xi (that is, for which P(xj | xi ) = 1), is removed from the training data by replacing Xi and Xj with a single attribute Xi Xj with all combinations of values of Xi and Xj except for those for which P(xi , xj ) = 0. The condition for merging two attributes can be relaxed further, by allowing two attributes (Xi and Xj ) to be merged if they have any values xi and xj such that P(xi , xj ) = 0. This condition is equivalent to the first, if the domain of Xj has two attribute values. On the other hand, it is more relaxed in the case where Xj has more than two values. We evaluated the effectiveness of both variants. To illustrate the difference between these two conditions, consider the data presented in Table 3, which contains three attributes TopLeft, TopMiddle, TopRight and class Class. Based on the data, TopRight and TopLeft satisfy both subsumption criteria as P(TopRight = x | TopLeft = x) = 1 and P (TopRight = o | TopLeft = x) = 0. However, TopLeft and TopMiddle only satisfy the latter criterion as P(TopMiddle = x | TopLeft = o) = 0, P(TopMiddle = o | TopLeft = o) =1 and P(TopMiddle = b | TopLeft = o) =1. The ESR algorithm repeatedly merges attributes until no further merges are possible. During each iteration, all attribute pairs Xi , Xj are identified that satisfy the subsumption criteria and the frequencies of all their attribute-value pairs are either 0 or greater than a Table 3 A hypothetical example pre-defined minimum frequency m. If multiple candidates are found, the Xi , Xj with the highest information gain ratio is merged. This process is repeated until no further Xi , Xj pairs are found. The data transformation is implemented as a filter that is applied to the training data. The trained filter is applied to each test instance prior to classification. Thus the transformation is transparent to the classification algorithm. In the case of applying ESR to NB or AODE, the conditional probabilities are estimated based on the transformed data, and the posterior probabilities are also calculated based on the transformed data. Consequently, neither NB nor AODE requires any modifications. At training time NBESR requires a two-dimensional table of probability estimates for each attribute-value conditioned by each other attribute-value in addition to the probability estimate tables of NB. This results in an overall space complexity of O(knv + (nv)2). The time complexity of forming this table is O(tn2). As the ESR algorithm repeatedly merges attributes, the worse case overall time complexity is O(tn2). Classification of a single example in NBESR does not have any effect on the time or space complexity of NB. In the case of AODE, the space complexity of AODEESR is identical to AODE. Its worst case training time complexity is O(tn2). The classification time and space complexities of AODEESR are identical to AODE. 7 Complexity summary Table 4 summarizes the complexity of each of the algorithms discussed. We display the time complexity and the space complexity of each algorithm for each of training time and classification time. AODEBSE has the highest training time complexity that is cubic in the number of attributes. However, it may in practice find parent and child sets with less computation due to the statistical test employed. NBBSE has the second highest training time complexity and lowest classification time complexity. It can efficiently classify test instances once the models are generated. The training time of all variants is linear with respect to number of training examples. When classification time is of major concern, NBBSE, NBESR and NB may excel. NBLSR, NBNSR, AODE and all its variants have high classification time when the number of attributes is large, for example, in text classification. Nonetheless, for many classification tasks with moderate or small number of attributes, their classification time complexity is modest. AODEBSE and AODENSR have relatively high training space complexity. Table 4 Computational complexity k is the number of classes n is the number of attributes t is the number of training examples v is the mean number of values for an attribute 8 Empirical study To evaluate the efficacy of ESR, LSR and NSR, we compare NB and AODE with and without ESR, LSR or NSR using the bias and variance definitions of Kohavi and Wolpert (1996) together with the repeated cross-validation bias-variance estimation method proposed by Webb (2000) on sixty natural domains from the UCI Repository (Newman et al. 1998). In order to maximize the variation in the training data from trial to trial we use twofold cross validation. We also compare these methods to NB and AODE with BSE, logistic regression (LR) and LibSVM. As we cannot obtain results of LibSVM on the two largest data sets (Covertype and Census-Income (KDD)), the comparison in Sect. 8.5 only includes 58 data sets. Table 5 summarizes the characteristics of each data set, including the number of instances, attributes and classes. Algorithms are implemented in the Weka workbench (Witten and Frank 2005). Experiments on algorithms, except LR and LibSVM, were executed on a 2.33 GHz Intel(R) Xeon(R) E5410 Linux computer with 4 Gb RAM, and those on LR and LibSVM were executed on a Linux Cluster based on Xeon 2.8 GHz CPUs. The base probabilities were estimated using m-estimation (m = 0.1) (Cestnik 1990).2 When we use MDL discretization (Fayyad and Irani 1993) to discretize quantitative attributes within each cross-validation fold, many quantitative attributes have only one value. Attributes with only one value do not provide information for classification, and hence we discretize quantitative attributes using 3-bin equal frequency discretization. In order to allow the techniques to be compared with Wekas LR and LibSVM, missing values for qualitative attributes are replaced with modes and those for quantitative attributes are replaced with means from the training data. 2As m-estimation often appears to lead to more accurate probabilities than Laplace estimation, this paper uses m-estimation to estimate the base probabilities. Therefore, the results presented here may differ from that of Zheng and Webb (2006, 2007) which uses Laplace estimation. Att Class No. Domain Table 5 Data sets No. Domain 8.1 Minimum frequency for identifying generalizations for LSR As there does not appear to be any formal method to select an appropriate value for l, we perform an empirical study to select it. We present the zero-one loss and RMSE results in the range of l = 10 to l = 150 with an increment of 10. Fig. 2 Averaged zero-one loss across 60 data sets, as function of l Fig. 3 Averaged RMSE across 60 data sets, as function of l Mean zero-one loss and RMSE Averaged results across all data sets provides a simplistic overall measure of relative performance. We present the averaged zero-one loss and RMSE of NBLSR and AODELSR across 60 data sets as a function of l in Figs. 2 and 3. In order to provide comparison with NB and AODE, we also include NB and AODEs zero-one loss and RMSE in each graph. For all settings of l, NBLSR and AODELSR enjoy lower mean zero-one loss and RMSE compared to NB and AODE respectively. Zero-one loss Table 6 presents the win/draw/loss records of zero-one loss for NB against NBLSR and AODE against AODELSR. We assess a difference as significant if the outcome of a one-tailed binomial sign test is less than 0.05. Boldface numbers indicate that wins against losses are statistically significant. NBLSR enjoys a significant advantage in zero-one loss over NB when 30 l 150. The advantage of AODELSR over AODE is statistically significant when 70 l 150. RMSE The win/draw/loss records of RMSE for NB against NBLSR and AODE against AODELSR are presented in Table 7. An advantage to NBLSR over NB is evident for all evaluated settings of l (i.e. 10 l 150). AODELSR has significant RMSE advantage over AODE for 40 l 70 and 130 l 150. AODELSR also enjoys nearly significant zero-one loss advantage (p < 0.1) for 80 l 110. Minimum frequency selection A larger value of l can reduce the risk of incorrectly inferring that one value subsumes another, but at the same time reduces the number of true Table 6 Win/Draw/Loss comparison of zero-one loss l = 20 l = 30 Table 7 Win/Draw/Loss comparison of RMSE l = 10 l = 90 l = 10 l = 90 l = 40 l = 120 l = 40 l = 120 l = 50 l = 130 l = 50 l = 130 l = 40 l = 120 l = 40 l = 120 l = 50 l = 130 l = 50 l = 130 l = 60 l = 140 l = 60 l = 140 l = 60 l = 140 l = 60 l = 140 l = 70 l = 150 l = 70 l = 150 l = 70 l = 150 l = 70 l = 150 l = 80 l = 80 l = 80 l = 80 l = 100 l = 20 l = 100 l = 20 l = 100 l = 20 l = 100 l = 110 l = 30 l = 110 l = 30 l = 110 l = 30 l = 110 generalizations that are detected. The setting l = 100 for NBLSR has a significant zero-one loss and RMSE advantage over NB. It also has a significant zero-one loss advantage for AODELSR. The RMSE advantage of AODELSR with setting l = 100 is nearly significant (p = 0.08). Consequently, the setting l = 100 is selected in our current work. In the earlier paper (Zheng and Webb 2006) where Laplace estimation is employed, 30 is used as the minimum frequency because it is a widely used heuristic for the minimum number of examples from which an inductive inference should be drawn. In fact, in other unreported experiments we have performed, when using Laplace estimation, NBLSR and AODELSR have a significant zero-one loss advantage over NB and AODE respectively at all settings of l except 10 and 20. NB and AODEs RMSE can be significantly reduced by the addition of LSR at all settings of l. Table 9 Win/draw/loss: AODEsESR, AODErESR vs. AODE and AODEsESR vs. AODErESR Table 8 Win/draw/loss: NBsESR, NBrESR vs. NB and NBsESR vs. NBrESR 8.2 Attribute merging criterion for ESR As explained in Sect. 6.2, ESR repeatedly merges pairs of attributes that satisfy the subsumption criteria of having attribute-values xi and xj that either satisfy P (xj | xi ) = 1 or P (xj , xi ) = 0. All pairs of attributes that satisfy P(xj | xi ) = 1 also have an attribute-value xk that satisfies P(xk , xi ) = 0. However, the reverse may not be true if one of the two attributes has more than two values. We refer to the version of ESR that merges attributes if two attribute values satisfy P(xj | xi ) = 1 as strict ESR (ESRs ), and the other as relaxed ESR (ESRr ). The win/draw/loss records for NBsESR and NBrESR are given in Table 8. The p value is the outcome of a onetailed binomial sign test. NBsESR significantly reduces bias and RMSE of NB at the expense of significantly higher variance. It has a nearly significant zero-one loss advantage over NB (p = 0.09). NBrESR also significantly reduces the bias of NB at the expense of significantly increased variance. It performs nearly significantly better than NB in terms of zero one loss (p = 0.06) and RMSE (p = 0.11). Based on the win/draw/loss records, NBsESR does not have any significant differences in comparison to NBrESR. Table 9 presents the win/draw/loss records of comparisons between AODEsESR and AODErESR. Both AODEsESR and AODErESR perform significantly better than AODE in terms of zero-one loss, bias and RMSE. The variance of AODErESR is nearly significantly worse than AODE. AODEsESR has a significant RMSE advantage over AODErESR. Although AODEsESR has lower zero-one loss and bias more often than AODErESR this is not statistically significant. Considering that AODEsESR has significantly lower RMSE in comparison to AODErESR and it has lower zero-one loss more often than AODErESR, we chose ESRs for further analysis. For ease of exposition, we refer to ESRs as ESR from here on. 8.3 Effects of BSE, LSR, NSR, ESR In this section, we evaluate the effect of BSE, NSR, LSR and ESR on NB and AODE. As NBBSE without a binomial sign test has a significant zero-one loss advantage relative to Table 10 Win/draw/loss: NBBSE, NBLSR, NBNSR and NBESR vs. NB Table 11 Win/draw/loss: AODEBSE, AODELSR, AODENSR and AODEESR vs. AODE NBBSE with a binomial sign test (win/draw/loss being 38/2/20), we only present the results of NBBSE without a binomial sign test. In the context of AODE, the zero-one loss advantage of PCE with a binomial sign test relative to PCE without a binomial sign test is significant (win/draw/loss 37/0/23). PCE with a binomial sign test frequently obtains lower zero-one loss than CE, PCE and PCE with a binomial sign test. Therefore, we present the result of PCE with a binomial sign test, which is indicated as AODEBSE. The minimum frequency for LSR, NSR and ESR is set to l = 100. We select r value for NSR in the range of 0.75 to 0.99 with an increment of 0.01 by using leave-one-out cross validation. The value with the lowest cross-validation zero-one loss is selected. A binomial sign test is used to assess whether an zero-one loss reduction resulting from using a r value is significant. If the best leave-one-out cross validation zero-one loss is not significantly higher than the zero-one loss of its base learner, NSR defaults to LSR. Table 10 presents the win/draw/loss records for NBBSE, NBNSR, NBLSR and NBESR against NB on sixty data sets. The p value is the outcome of a one-tailed binomial sign test. All four improvements to NB have significant zero-one loss, bias and RMSE advantages over NB. NBBSE, NBLSR and NBNSR have significant variance disadvantage relative to NB. The variance disadvantage of NBLSR is not significant. The win/draw/loss records for AODEBSE, AODENSR, AODELSR and AODEESR are shown in Table 11. All four improvements to AODE significantly reduce AODEs zero-one loss. AODENSR and AODEESR have significant RMSE advantages over AODE. AODEBSE and AODELSR have lower RMSE more often than AODE and this result is almost significant (p < 0.1). AODEBSE, AODENSR and AODEESR have significant bias advantage over AODE. AODENSR has a significant variance disadvantage relative to AODE. AODEBSE, AODELSR and AODEESR also have variance disadvantages relative to AODE, however, these results are not significant. Table 12 Win/draw/loss: NB and its improvements handling missing values directly compared with missing value imputation for the 20 datasets that contain missing values Table 13 Win/draw/loss: AODE and its improvements handling missing values directly compared with missing value imputation for the 20 datasets that contain missing values 8.4 Handling missing values Both NB and AODE have the ability to directly handle missing values. We evaluate the effectiveness of the discussed improvements to NB and AODE in handling missing values, by comparing their performance when doing so against their performance on the 20 datasets with missing values replaced by either modes or means from the training data. The win/draw/loss records of NB and its improvements on datasets with and without missing values are given in Table 12. NB and its improvements have zero-one loss and RMSE advantages when missing values are handled directly. The zero-one loss advantage of NBESR when missing values are directly handled is nearly significant (p = 0.07). NBNSR, NBLSR and NBESR directly handling missing values have a nearly significant (p = 0.07) variance advantage over the corresponding algorithms that use missing value imputation. AODE, AODEBSE, AODENSR and AODELSR directly handling missing values have lower RMSE significantly more often than the corresponding algorithms with missing value imputation. The zero-one loss of all the AODE variants is also reduced more often when missing values are directly handled, but this result is not significant. The variance also is reduced marginally more often when missing values are directly handled. 8.5 Comparison of ten algorithms In this section, we compare the ten algorithms discussed with LR and LibSVM. We use Wekas implementations and default settings of LR, which builds a multinomial logistic regression model with a ridge estimator whose default value is e8. We use Wekas implementations and default settings of LibSVM with the exceptions of turning on normalization of data and performing a grid-search on C and for the RBF kernel using 5-fold crossvalidation. Each pair of (C, ) is tried (C = 25, 23, . . . , 215, = 215, 213, . . . , 23) and the one with the lowest cross-validation zero-one loss is selected. Due to the high time complexity of this process, the results of LibSVM on Covertype and Census-Income (KDD) have not been obtained and those on Connect-4 Opening, Shuttle, Adult, Letter Recognition and MAGIC Gamma Telescope are obtained from five runs of two-fold cross-validation. LibSVM in Weka uses a logistic function to calibrate the probability output. however, it is substantially slower than LibSVM without calibration. To avoid even slower training, we do not calibrate its output to produce probability estimates. It uses the one-against-one approach to generalizing from two-class classification to multi-class classification. The NB and AODE variants discussed can only handle categorical data. On the other hand, LibSVM and LR cannot handle missing values. In order to compare the discussed algorithms with LibSVM and LR, we evaluated all the algorithms using 3-bin equal frequency discretization of quantitative attributes and missing values replaced by either their modes or means. While LibSVMs performance is superior on numeric data, it was evaluated on discretized data to provide a comparison of performance on categorical data. Although AODE has been extended to handle numeric data, the research is still at an early stage (Flores et al. 2009). Thus, comparison of AODE with LR and LibSVM on numerical datasets is left for future work. Demar (2006) recommends the Friedman test (Friedman 1937, 1940) for comparisons of multiple algorithms over multiple data sets. It first calculates the ranks of algorithms for each data set separately (average ranks are assigned if there are tied values), and then compares the average ranks of algorithms over data sets. The null-hypothesis is that there is no difference in average ranks. We reject the null-hypothesis if the Friedman statistic derived by Iman and Davenport (1980) is larger than the critical value of the F distribution with a 1 and (a 1)(D 1) degrees of freedom for = 0.05, where a is the number of algorithms and D is the number of data sets. If the null-hypothesis is rejected then it is probable that there is a true difference in the average ranks of at least two algorithms. Posthoc tests, such as the Nemenyi test, are used to determine which pairs of algorithms have significant differences. With 12 algorithms and 58 data sets, the Friedman statistic is distributed according to the F distribution with 12 1 = 11 and (12 1) (58 1) = 627 degrees of freedom, The critical value of F(11, 627) for = 0.05 is 1.8039. The Friedman statistic for zero-one loss, bias and variance in our experiments are 7.8281, 17.7483 and 9.0832 respectively, and hence we reject all the null-hypotheses. The Nemenyi test is used to further analyze which pairs of algorithms are significantly different. Let dij be the difference between ith algorithm and j th algorithm. We assess a difference between ith algorithm and j th algorithm as significant if dij > Critical Difference (CD). With 12 algorithms and 58 data sets, the Critical Difference for = 0.05 is CD = 3.164 a (a + 1)/(6 D) = 3.164 12 (12 + 1)/(6 58) = 2.1184. As we do not obtain LibSVMs probability estimates, we present RMSE results for all the algorithms except LibSVM. The critical value of F(10, 570) for = 0.05 is 1.8473, and the Friedman statistic for RMSE is 19.5832. Therefore, the null-hypothesis that there is no difference in average RMSE ranks is rejected. The Critical Difference for = 0.05 with 11 algorithms and 58 data sets is CD = 1.9486. Following the graphical presentation proposed by Demar, we show the comparison of these algorithms against each other with the Nemenyi test on zero-one loss, bias, variance and RMSE in Figs. 4 and 5. We plot the algorithms on the left line according to their average ranks, which are indicated on the parallel right line. Critical Difference (CD) is also presented in the graphs. The lower the position of algorithms, the lower the ranks will be, and hence the better the performance. The algorithms are connected by a line if their differences are not significant. Since the comparison involves 12 algorithms, the power of the Nemenyi test is low and so only large effects are likely to be apparent. Fig. 4 Zero-one Loss and RMSE comparison with the Nemenyi test on 58 data sets. CD = 2.1184 for zero-one loss and CD = 1.9486 for RMSE 8.5.1 Zero-one loss and RMSE AODENSR achieves the lowest mean zero-one loss rank (5.078), followed by AODELSR (5.224). They enjoy a significant zero-one loss advantage relative to NBBSE, LR, NBLSR, NBESR and NB. LibSVM is ranked third overall (5.362). The Nemenyi test differentiates LibSVM from LR, NBLSR, NBESR and NB. AODEBSE has a significantly lower mean zeroone loss rank than NBLSR, NBESR and NB. AODENSR, AODEESR, AODEBSE and AODELSR have lower mean zero-one loss ranks than AODE, but not significantly so. Due to the low power of the Nemenyi test when a large number of algorithms are compared, these results differ from those of Sect. 8.3, in which BSE, NSR and LSR provide significant zero-one loss reductions in NB and all four improvements to AODE (BSE, NSR, LSR and ESR) significantly improve upon the zero-one loss of AODE. When RMSE is compared, there are two clear groups. AODENSR, AODEESR, AODELSR, AODEBSE and AODE deliver significantly lower mean zero-one loss ranks than all the other algorithms. AODENSR and AODEESR achieve the lowest and second lowest mean RMSE ranks (3.914 and 4.095 respectively). The differences RMSE ranks between NBNSR, NBLSR, NBBSE and NBESR are small, ranging from 8.095 to 8.888. 8.5.2 Bias and variance AODEBSE obtains the lowest mean bias rank (4.4828), followed closely by AODENSR (4.5776). LibSVM and LR come next (4.8103 and 4.8707 respectively). The bias disadvantages of NBNSR, NBBSE, NBLSR, NBESR and NB relative to all remaining algorithms are clear. BSE has the largest effect on reducing the bias of AODE and NSR has the largest effect on reducing the bias of NB. However, due to the statistical test employed, the effect is not significant. NB has the lowest mean variance rank. NBESR and NBLSR have the second and the third lowest mean variance ranks. These three algorithms achieve significantly lower mean variance ranks than AODEESR, AODEBSE, AODENSR, LibSVM and LR. The NB variant algorithms achieve significantly lower mean variance ranks than AODENSR, LibSVM and LR. 8.6 Average elimination ratio To observe the percentage of generalizations or near-generalizations, we calculate average attribute elimination ratios for LSR and NSR on each data set, obtained by dividing the number of attributes deleted by the number of attributes across all the test examples and iterations: eLSR = where u is the number of iterations (it is 50 in our experiment) and eoi is the number of attributes deleted for the oth instance in the ith iteration. NSR also uses (7) to calculate the average elimination ratio eNSR. 8.6.1 Average elimination ratio of LSR Figure 6 shows average elimination ratios of LSR. An average elimination ratio of zero represents no deletions. The larger the elimination ratio, the more attributes that are deleted. The data sets in Fig. 6 are in the number sequence of Table 5. Since the attributes deleted do not change from classification algorithm to algorithm, NBLSR and AODELSR have an identical elimination ratio on the same data set. As illustrated in Fig. 6, elimination occurs Fig. 6 Average attribute elimination ratio of LSR. The data sets are in the number sequence of Table 5 Fig. 7 Zero-one loss ratio on 22 out of 60 data sets. For more than 5% data sets, over 50% of attribute values are eliminated. For more than 15% of data sets, over 10% of attribute values are eliminated. As a larger value of l can reduce the number of true generalizations that are detected, higher percentages of attribute values are deleted when smaller values of l are used. The average elimination ratios on four data sets are greater than 0.5. The zero-one loss ratios of NBLSR to NB and AODELSR to AODE on these four data sets are shown in Fig. 7(a) and those of NBNSR to NB and AODENSR to AODE are shown in Fig. 7 (b). RMSE results are shown in Fig. 8(a) and (b). Ratios less than one indicate improvement. On Covertype, eLSR = 0.7826, indicating that more than 78% of attribute values are eliminated. The reason for this high ratio is because this data sets has 44 binary attributes, each having a value of 0 if a type of wilderness area or soil is absent and a value of 1 otherwise. The 11th to 14th attributes describe four types of wilderness areas respectively. If one of these attributes has a value of 1, then the other attributes will have a value of 0. For any pair of these four attributes, there are three possible values of {0, 0}, {0, 1} and {1, 0}, two of them have the substitution relationship. For all these four attributes, there are 18 possible pairs of attribute values, two-thirds of them having the substitution relationship. The same rule applies to the 15th to 54th attributes. There are 2340 possible pairs of attribute values with two-thirds of them having the substitution relationship. In addition, the generalization relationship is frequently detected between wilderness areas and soil type. An example for this relation is that the 15th with value of 1 is a specialization of the 11th with value of 0. For this data, the Fig. 8 RMSE ratio test time of AODE is substantially reduced from 4987.96 seconds to 928.60 seconds, despite LSR has an additional step to detect generalizations. The zero-one loss ratios of NBLSR to NB and AODELSR to AODE are 0.9967 and 0.9845 and the RMSE ratios are 0.9953 and 0.9897 respectively. When we apply LSR to this data without considering the relationships between these binary attributes, the average elimination ratio is 0.0994. The average elimination ratio on Annealing is 0.7046. One factor that contributes to this high ratio is that this data has many missing values. More than 76% attributes have missing values and 70% attributes have more than 50% missing values. Many attributes only have one value in addition to missing values. Quantitative attributes in this data do not have missing values. When those missing values for qualitative attributes are replaced with modes, these attributes only have one value, which results in a large number of generalizations. As NB and AODE can deal with missing values, we apply LSR to NB and AODE without replacing missing values. The resulting average elimination ratio is 0.1126. Several exemplar generalizations for this data are: Len < 0.5 is a generalization of Shape = COIL, Bore = 0000 is of Width < 609.95, Shape = SHEET and Len > 821, Shape = SHEET is of Len > 0.5 and Strength < 150 is of Thick < 0.6995. AODEs test time is reduced from 2.99 seconds to 1.16 seconds. The zero-one loss ratios of NBLSR to NB and AODELSR to AODE are 0.9427 and 0.8936 and the RMSE ratios are 0.9569 and 0.9316 respectively. The test time of AODE is reduced from 4.11 seconds to 3.10 seconds. The third largest average elimination ratio is on Connect-4 Opening. It deletes 64.51% of attribute values on average. Connect 4 is a game in which two players take turns in placing pieces on a 7-column, 6-row vertically-suspended grid and will try to get four connected singly-colored pieces, either horizontally, vertically or diagonally. The 6 rows are numbered 1 through 6 and the 7 columns are labeled a through g. There are 42 attributes, each having 3 values. An attribute has a value of x if the corresponding square is occupied by the first player and a value of o if the square is occupied by the second player. Otherwise, this attribute has a value of b. Figure 9 shows an example from this data set. When a piece is placed in one of the columns, it will fall down to the lowest unoccupied square in the column. Therefore, all squares higher than the lowest unoccupied square are empty. From this rule, we have z {a, b, c, d, e, f, g} and 1 i < j 6 if zi = b then zj = b. In other words, zj = b is a generalization of zi = b. In Fig. 9, 27 empty squares (those grey colored squares) are generalizations. Due to a large number of attribute values deleted, LSR Fig. 9 An example from Connect-4 Opening. A grey colored square is a generalization of the lowest unoccupied square in the column substantially reduces the test time of AODE from 222.91 seconds to 59.97 seconds. The zero-one loss ratios of NBLSR to NB and AODELSR to AODE are 0.9810 and 0.9819 and the RMSE ratios are 0.9933 and 0.9911 respectively. On average, 51.07% attribute values are deleted on Mushroom. The number of attributes is 22 and the mean number of values per attribute is 6.7. In total, there are 22 6.7 (22 6.7 6.7)/2 = 10340.11 combinations of attribute values. Among them, 227 pairs of attribute values are detected to have the generalization relationship. For example, Cap-shape = convex is a generalization of Odor = creosote and Gill-attachment = free is a generalization of Gill-spacing = crowed. LSR reduces NBs zero-one loss and RMSE from 0.0109 and 0.0946 to 0.0046 and 0.0606 respectively. The zero-one loss of AODE is unchanged by the addition of LSR. However, it is not immediately clear why the application of LSR increases the RMSE of AODE from 0.0162 to 0.0190. Almost half of attribute values are deleted on Census-Income (KDD) (eLSR being 0.4953). On the whole training data, 332 attribute values are identified as a generalization of another attribute value. The generalization relationship between attribute values in this data is often obvious. For instance, if State-of-previous-residence = Florida, then Region-ofprevious-residence = south. The discretized values for the first attribute Age are 21.5, 43.5 > 21.5 and >43.5. The 23th attribute is Detailed-household-and-family-stat. If the value of this attribute is Child < 18-never-marr-not-in-subfamily, then Age 21.5. The 32th attribute is Family-members-under-18. All values except not-in-universe are specializations of Age 21.5. The 27th attribute is Migration-code-change-in-reg with 8 values. The 28th attribute is Migration-code-move-within-reg with 9 values. Five values of these two attributes are identical. These values are not-in-universe, nonmover, same-county, differentcounty-same-state and abroad. The 27th attribute with any of these five values is a substitution (and so generalization) of the 28th attribute with a corresponding value. The zero-one loss ratios of NBLSR to NB and AODELSR to AODE are 0.9367 and 0.8052 and the RMSE ratios are 0.9367 and 0.8834 respectively (shown in Figs. 7 and 8). The test time of AODE is reduced from 478.93 seconds to 291.20 seconds. 8.6.2 Average elimination ratio of NSR Figure 10 presents average elimination ratios for NSR. As NSR considers specific classification algorithms in the process of selecting r value, NBNSR and AODENSR can have different average elimination ratios. When NSR is applied to NB and AODE, elimination occurs on more than 40% data sets. For 10% data sets, over 50% of attribute values are eliminated. For more than 20% of data sets, more than 10% of attribute values are eliminated. The average elimination ratio on Covertype is 0.8145 for NBNSR and 0.8087 for AODENSR. The zero-one loss ratios of NBNSR to NB and AODENSR to AODE are 0.9604 and 0.9689 and the RMSE ratios are 0.9834 and 0.9935 respectively (see Figs. 7(b) and 8(b)). To observe the number of pairs of attribute values that have the near-generalization relationship, Fig. 10 Average attribute elimination ratio of NSR. The data sets are in the number sequence of Table 5 we apply NSR to the whole training data. As r value changes from one cross-validation run to another, we use the most frequently selected value of r = 0.75 in 50-run 2-fold cross validation for NBNSR. 3078 pairs of attribute values are identified to have near-generalization relationships. An example for the relationships is that if aspect in degrees azimuth is between 78.5 and 205.5 then we can roughly infer that hill shade index at 9 am is greater than 226.5. Census-Income (KDD) has the second largest eNSR = 0.7976 for both NBNSR and AODENSR. The zero-one loss and RMSE of NB and AODE are substantially reduced by the addition of NSR. The zero-one loss ratios of NBNSR to NB and AODENSR to AODE are 0.7146 and 0.6680 and the RMSE ratios are 0.7851 and 0.7967 respectively. When NSR is applied to the full training data with r = 0.75, a value selected by most folds for NBNSR, 10235 attribute values are detected as near-generalizations. For example, if Classof-worker = never-worked, we can infer that most of them are younger than 21.5. If Wageper-hour > 800.5, we can approximately infer that Class-of-worker = private. Most people that work in the construction industry are male. Over 90% people whose wage per hour is greater than 800.5 dollars were born in the United States. Figure 4 does not reveal the zero-one loss and RMSE differences between LSR and NSR as significant since the number of algorithms compared is large and consequently the power of the Nemenyi test is low. When NSR is compared with LSR, the former has a significant zero-one loss, bias and RMSE advantages relative to the latter on NB and AODE. Table 14 presents the win/draw/loss records for NSR against LSR and BSE on NB and AODE. The zero-one loss and RMSE differences between NSR and BSE are small when they are applied to NB, while NSR has a marginal zero-one loss advantage and significant RMSE advantage relative to BSE when they are applied to AODE. In this section, we investigate the circumstances under which deleting near-generalizations proves advantageous based on two exemplar data sets (Adult and Abalone), both deleting more than 10% of attribute values. 8.6.4 Adult The classification task of Adult is to predict whether income exceeds fifty thousand US dollars a year. It has 14 attributes (6 continuous and 8 discrete) besides the class label. The 5th attribute Education-num recodes the 4th attribute Education from a descriptive to a numeric format and hence Education-num without discretization is a substitution of Education and Education-num with discretization is a generalization of Education. Given Education, Education-num is redundant. This redundancy can be detected by LSR. The zero-one losses Table 14 Win/draw/loss: NBNSR vs. NBLSR and NBBSE and AODENSR vs. AODELSR and AODEBSE NBNSR vs. NBLSR AODENSR vs. AODELSR NBNSR vs. NBBSE AODENSR vs. AODEBSE of AODE with all attributes and all attributes except Education-num are 0.1598 and 0.1588 respectively. However, NB with all attributes has lower zero-one loss (0.1727) than NB with all attributes except Education-num (0.1851). The zero-one loss of NBLSR and AODELSR are 0.1802 and 0.1575 respectively. As NBLSR and AODELSR delete attributes depending upon which attribute values are instantiated in the object being classified, they may have different results from that of NB and AODE when deleting complete attributes. The generalization relationship is detected between another 6 pairs of attribute values. NSR substantially improves upon NB and AODE on Adult. The zero-one loss of NB and AODE are reduced from 0.1727 and 0.1598 to 0.1550 and 0.1484 respectively and the RMSE of NB and AODE are reduced from 0.3550 and 0.3383 to 0.3309 and 0.3213 respectively. We investigate the attributes that are deleted by NSR employing r = 0.98, the most frequently selected value in 50-run 2-fold cross validation for NBNSR. Our experiment reveals that both LSR and NSR delete generalizations discussed above for most test instances, and NSR also deletes two other types of attributes. They are near-generalizations and attributes with noise. 8.6.4.1 Near-generalizations The 6th attribute Marital-status and the 8th attribute Relationship are closely associated. If a person is classified as a wife (or husband), she (or he) must be a married person. However, we could not make the further judgement whether a married person is either a Married-civ-spouse or Married-AF-spouse due to two types of marriage being listed in the data set. There are 22379 instances of Married-civ-spouse and 37 instances of Married-AF-spouse. It is obvious that civilian marriages account for the majority of marriages. Therefore, we can approximately infer that a married person belongs to a civilian marriage. That is, Marital-status = Married-civ-spouse is a near-generalization of Relationship = Husband and Relationship = wife. To evaluate the effect of deleting Maritalstatus = Married-civ-spouse using r = 0.98, we apply NSR to NB and AODE but restrict its application to deleting only values of Marital-status. The zero-one loss and RMSE of NBNSR are 0.1572 (< 0.1727) and 0.3402 (< 0.3550) respectively, and those of AODENSR are 0.1511 (< 0.1598) and 0.3280 (< 0.3383) respectively. These results suggest that the elimination of a near-generalization accounting for a large part of population to which nearspecializations belong might be positive. There are 10 pairs of attribute values are identified to have the near-generalization relationship. Fig. 11 Close correlation of Length and Diameter 8.6.4.2 Attributes with noise The 10th attribute Sex has two values of female and male. These values have a clear generalization relationship with the two values (husband and wife) of attribute Relationship. That is, Sex = female is a generalization of Relationship = wife and Sex = male is a generalization of Relationship = husband. However, due to noise in the data, the relation can not be detected by LSR. The values of Relationship and Sex of the 7110th instance in Adult are husband and female respectively. Another two cases are the 576th and 27142th instances in which Relationship = wife and Sex = male. When we apply NSR to NB and AODE but restrict its application to deleting only values of Sex, NBNSR and AODENSR have zero-one losses of 0.1659 (< 0.1727) and 0.1574 (< 0.1598) respectively and RMSEs of 0.3476 (< 0.3550) and 0.3349 (< 0.3383) respectively. These results indicate that the near-generalization technique can be useful in at least some cases of noise. 8.6.5 Abalone In Abalone, the classification task is to predict the age of an abalone from its physical measurements, many of which are closely correlated to one another. Since NB and AODE cannot handle numeric classes, we select the only attribute (Sex) that has categorial values (M , F and I ) as the class. Figure 11(a) presents the scatter graph that plots the values of Length versus the corresponding values of Diameter without discretization. The relationship between these two attributes is linear. Similar relationships are observed for other attributes (scatter graphs are not presented), specifically Shucked-weight and Viscera-weight are linearly related, Wholeweight and Length, and Diameter and Shell-weight are roughly linearly related. As Diameter and Length are positively linearly related, it is logical to infer that an abalone with small diameter is shorter. However, there are some exceptional cases where abalones with small diameter measure longer than average. Figure 11(b) presents the relationship between Length and Diameter with discretized values. Dark grey blocks are cases with Diameter 0.3775, light grey blocks are cases with Diameter > 0.5925 and white blocks are cases with other values. The Diameter of 1330 cases out of 1374 cases with Length 0.4825 is less than 0.3775, that of 1206 cases out of 1433 cases with 0.4825 < Length 0.5925 is between 0.3775 and 0.4625, and 1283 cases Fig. 12 Learning curves on Abalone out of 1370 cases with Length > 0.5925 is larger than 0.4625. The near-generalization relationship between values of these two attributes is clear, for example, Diameter 0.3775 is a near-generalization of Length 0.4825. LSR cannot find these near-relations most of the time. Note that when the relationship is detectable, a deletion will only occur for test cases that are not themselves outliers as, for example, a long abalone with small diameter will not be an instance of the detected near-generalization relationship. To observe the effect of elimination of near-generalizations, we apply NSR to NB and AODE using values of r in the range of 0.99 to 0.70 with an decrement of 0.1. Figure 12(a) shows the learning curves on zero-one loss in which each point represents the zero-one loss of NBNSR and AODENSR corresponding to each r on the x-axis. Figure 12(b) presents the learning curves on RMSE for NBNSR and AODENSR. The zero-one loss and RMSE of LSR are also included in the graphs. The two decimal numbers on the x-axis are the lower bounds of the near-generalization relationship. NBNSR has a largely downward trend in zero-one loss over the range of r = 0.99 to r = 0.76. The RMSE of NB starts with a steady decline and stabilizes to 0.429 from r = 0.72. The RMSE of AODE decreases slightly from r = 0.99 to r = 0.85, stabilizes to 0.4227 from r = 0.85 to r = 0.79, and then increases slightly. These results suggest that selecting an appropriate r value for NSR can have a positive effect on RMSE. There is a largely upward trend for the zero-one loss of AODENSR with decreasing values of r . The bias of NBNSR and AODENSR have a clear downward trend and the variance of these two methods have a clear upward trend (graphs are not presented). One possible reason for discrepant trends in zeroone loss of NB and AODE is the greater complexity of an AODE model compared to an NB model, resulting in greater variance. The increase in variance provided by NSR outweighs the reduction in bias and results in overall increase in zero-one loss for AODE, while NSR provides an appropriate bias-variance trade-off and results in overall reduction in zero-one loss for NB. 9 Conclusions and future work We have proposed novel techniques, LSR and NSR, to efficiently detect the generalization and near-generalization relationships, special forms of inter-dependency, and delete general izations and near-generalizations at classification time. We have also proposed ESR, which is a filter that transforms the training data to remove these relationships at training time. We investigate the effect of LSR, NSR and ESR on zero-one loss and RMSE by applying them to NB and AODE. Extensive experimental results (win/draw/loss records) show that LSR and NSR significantly improve upon NBs zero-one loss and RMSE. ESR also significantly improves upon NBs probability estimates, but its zero-one loss improvements are marginal. The zero-one loss and RMSE of AODE can be significantly enhanced by the addition of NSR and ESR. Whilst LSR improved the zero-one loss and RMSE of AODE more often than not, only the zero-one loss was improved significantly more often. LSR, NSR and ESR are suited to probabilistic techniques, such as NB and AODE, but not to similarity techniques. SR is related to attribute elimination, although it only eliminates specific values and only in the context of other specific values. For this reason we compared SR to BSE. BSE has considerably higher training time overheads than LSR. In the context of AODE, NSR has marginal classification and probabilistic prediction advantage relative to BSE. LSR inherits NB and AODEs capacity for incremental learning, while ESR, NSR and BSE do not support incremental learning. We believe that the appropriate conclusion to draw from our results is that LSR, NSR and ESR are effective at reducing error, rather than that they are necessarily superior to the BSE strategy in this respect in the AODE context. It is also possible that SR may be complementary to attribute elimination, with attribute elimination in the context of SR removing attributes that are problematic for reasons other than generalizationspecialization relationships. We explore reasons for high percentages of generalizations on three data sets. We also investigate the circumstances that NSR proves beneficial based on two exemplar data sets. When a near-generalization accounts for the majority of the population to which the corresponding near-specialization belongs, elimination of the near-generalization may excel. It might have an advantage when attributes are closely rather than perfectly associated. Furthermore, it may provide tolerance for noise to some extent. LSR and ESR provide computationally efficient techniques of reducing the dimensionality of the data. There are number of avenues for extending these techniques. The near generalization parameter r for NSR is currently chosen by performing a parameter search using cross-validation. A theoretical analysis to identify a more effective method of choosing r is an area of future work. Applying SR techniques to higher order average n dependence estimation algorithms such as A2DE and A3DE (see Webb et al. 2011, for details) is another area of future research. The order in which attributes are chosen for merging in ESR has a direct effect on the final outcome and the optimum order is likely to be different for NB and AODE. Exploration of effective methods of choosing the attribute merge order is a further area for future work. We use the Friedman and Nemenyi tests to compare NB, AODE and their variants with LR and LibSVM with a grid parameter search on categorical data. The results reveal the outstanding performance of AODENSR and AODEESR on our datasets. They enjoy considerable advantage in zero-one loss and RMSE over NB, NBBSE, NBNSR and NBLSR and LR. They also have a better mean zero-one loss rank in comparison to LibSVM. AODELSR also achieves high zero-one loss and RMSE with low training time and modest test time overheads. It is notable that all of the SR variants of AODE obtain zero-one loss comparable to SVM with a grid parameter search. This comparable performance is obtained with far less computation. It is not possible to provide meaningful compute time comparisons because the computational requirements of LibSVM on the large data sets required that it be run in a heterogeneous grid computing environment from which it is inherently not possible to obtain useful timing comparisons. Notably, the AODE variants are linear on the quantity of data and are capable of directly handling missing data. In addition, NSR is the only variant that has been tested here using a parameter search. The only parameters used by the other variants are l, which has been fixed to 100, and the value of m which is used in m-estimation, which is fixed at 0.1. Finally, LSR supports incremental learning and learns in a single pass through the training data, making it possible to learn from data that are too large to reside in RAM. Acknowledgements This research has been supported by the Australian Research Council under grant DP0772238. The authors are grateful to Janez Demar for his kind help with the Nemenyi test.


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1007%2Fs10994-011-5275-2.pdf

Fei Zheng, Geoffrey I. Webb, Pramuditha Suraweera, Liguang Zhu. Subsumption resolution: an efficient and effective technique for semi-naive Bayesian learning, Machine Learning, 2012, 93-125, DOI: 10.1007/s10994-011-5275-2