HITON: a novel Markov Blanket algorithm for optimal variable selection. (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1480117/pdf/

HITON: a novel Markov Blanket algorithm for optimal variable selection.

HITON: A Novel Markov Blanket Algorithm for Optimal Variable Selection C.F. Aliferis M.D., Ph.D., I. Tsamardinos Ph.D., A. Statnikov M.S. Department of Biomedical Informatics, Vanderbilt University, Nashville, TN ABSTRACT We introduce a novel, sound, sample-efficient, and highly-scalable algorithm for variable selection for classification, regression and prediction called HITON. The algorithm works by inducing the Markov Blanket of the variable to be classified or predicted. A wide variety of biomedical tasks with different characteristics were used for an empirical evaluation. Namely, (i) bioactivity prediction for drug discovery, (ii) clinical diagnosis of arrhythmias, (iii) bibliographic text categorization, (iv) lung cancer diagnosis from gene expression array data, and (v) proteomics-based prostate cancer detection. State-of-the-art algorithms for each domain were selected for baseline comparison. Results: (1) HITON reduces the number of variables in the prediction models by three orders of magnitude relative to the original variable set while improving or maintaining accuracy. (2) HITON outperforms the baseline algorithms by selecting more than two orders-ofmagnitude smaller variable sets than the baselines, in the selected tasks and datasets. INTRODUCTION The identification of relevant variables (also called features) is an essential component of construction of decision support models, and computer-assisted discovery. In medical diagnosis, for example, elimination of redundant tests from consideration reduces risks to patients and lowers healthcare costs [1]. The problem of variable selection in biomedicine is more pressing than ever, due to the recent emergence of extremely large datasets, sometimes involving tens to hundreds of thousands of variables. Such datasets are common in geneexpression array studies, proteomics, computational biology, text-categorization, information retrieval, mining of electronic medical records, consumer profile analysis, temporal modelling, and other domains [1-6]. Most variable selection methods are heuristic in nature and empirical evaluations have seldom exceeded domains with more than a hundred variables (see [7-9] and their references for reviews). Several researchers [1, 10, 11] have suggested, intuitively, that the Markov Blanket of the target variable T, denoted as MB(T), is a key concept for solving the variable selection problem. MB(T) is defined as the set of variables conditioned on which all other variables are probabilistically independent of T. Thus, knowledge of the values of the Markov Blanket variables should render all other variables superfluous for classifying T. Technical details about the distributional assumptions underlying this intuition, existence and uniqueness of MB(T), and relations to loss functions and classifier-inducing algorithms were only recently explored however, by the first two authors of the present paper [8]. From a practical perspective, identifying the Markov Blanket variables has proven to be a challenging task as evidenced by the limitations of prior methods. Specifically, the approaches in [1,2] are unsound (i.e., provably do not always return the correct MB(T) even with infinite sample and time); the method of [10] is sound but relies on inducing the full Bayesian network and thus does not scale up to the number of variables; the work in [11] is unsound and has poor average computational efficiency. Notably, two newer families of algorithms [8, 12] are sound and computationally efficient, but require sample exponential to the size of MB(T). In biomedical domains sample sizes are typically limited (and often sample-to-variable ratios are very small), however. The contribution of the present paper is that it introduces HITON1, a sound, sample-efficient, and highly scalable algorithm for variable selection for classification, based on inducing MB(T). HITON is sound provided that (i) the joint data distribution is Faithful to a BN, (ii) the training sample is enough for performing reliably the statistical tests required by the algorithm, and that (iii) one uses powerful enough classifiers (i.e., that can learn any classification function given enough data). A distribution is faithful to a BN if all the dependencies in the distribution are strictly those entailed by the Markov Condition of the BN [8]. The vast majority of distributions are faithful in the sample limit [13]. The question that arises is whether the algorithm, and by extension its assumptions, perform well in biomedical data (that, in addition, often involve thousands of variables and limited sample), and the typical classifiers used in practice. To empirically evaluate HITON, a wide variety of domains were selected with different characteristics. In addition, the best algorithms for each tasks were selected as baseline comparisons. 1 Pronounced “hee-tόn”. From the Greek Χιτών, for “cover”, “cloak”, or “blanket”. AMIA 2003 Symposium Proceedings − Page 21 A Novel Algorithm For Variable Selection The new algorithm is presented in pseudo-code in Figure 1. V denotes the full set of variables and ⊥(T ; X | S ) the conditional independence of T with variable set X given variable set S. HITON (Data D; Target T; Classifier-inducer A) “returns a minimal set of variables required for optimal classification of T using algorithm A” MB(T) = HITON-MB(D, T) // Identify Markov Blanket Vars = Wrapper(MB(T), T, A) // Use heuristic search to remove unnecessary variables Return Vars HITON-MB(Data D, Target T) “returns the Markov Blanket of T” PC = parents and children of T returned by HITON-PC(D, T) PCPC = parents and children of the parents and children or T CurrentMB = PC ∪ PCPC // Retain only parents of common children and remove parents of parents, children of parents, and children of children ∀ potential spouse X ∈ CurrentMB and ∀ Y ∈ PC: if ¬∃ S ⊆ {Y} ∪ V -{T, X} so that ⊥ (T ; X | S ) then retain X in CurrentMB else remove it Return CurrentMB HITON-PC(Data D, Target T) “returns parents and children of T” CurrentPC = {} Repeat Find variable Vi ∉ CurrentPC that maximizes association(Vi, T) and admit Vi into CurrentPC If there is a variable X and a subset S of CurrentPC s.t. ⊥(X : T | S) remove X from CurrentPC; do not consider X again for admission Until no more variables are left to consider Return CurrentPC Wrapper(Vars, T, A) “returns a minimal set among variables Vars for predicting T using classifier-inducer algorithm A and a wrapping (heuristic search) approach” Repeat Select and remove a variable from Vars. If internally cross-validated performance of A remains the same, permanently remove the variable. Until all variables are considered. Return Vars Figure 1: Pseudo-code for algorithm HITON. HITON-MB first identifies the parents and children of T by calling algorithm HITON-PC, then discovers the parents and children of the parents and children of T. This is a superset of the MB(T). False positives are removed by a statisti (...truncated)