HITON: a novel Markov Blanket algorithm for optimal variable selection.
HITON: A Novel Markov Blanket Algorithm for Optimal Variable Selection
C.F. Aliferis M.D., Ph.D., I. Tsamardinos Ph.D., A. Statnikov M.S.
Department of Biomedical Informatics, Vanderbilt University, Nashville, TN
ABSTRACT
We introduce a novel, sound, sample-efficient, and
highly-scalable algorithm for variable selection for
classification, regression and prediction called
HITON. The algorithm works by inducing the
Markov Blanket of the variable to be classified or
predicted. A wide variety of biomedical tasks with
different characteristics were used for an empirical
evaluation. Namely, (i) bioactivity prediction for
drug discovery, (ii) clinical diagnosis of arrhythmias,
(iii) bibliographic text categorization, (iv) lung
cancer diagnosis from gene expression array data,
and (v) proteomics-based prostate cancer detection.
State-of-the-art algorithms for each domain were
selected for baseline comparison. Results: (1) HITON
reduces the number of variables in the prediction
models by three orders of magnitude relative to the
original variable set while improving or maintaining
accuracy. (2) HITON outperforms the baseline
algorithms by selecting more than two orders-ofmagnitude smaller variable sets than the baselines, in
the selected tasks and datasets.
INTRODUCTION
The identification of relevant variables (also called
features) is an essential component of construction of
decision support models, and computer-assisted
discovery. In medical diagnosis, for example,
elimination of redundant tests from consideration
reduces risks to patients and lowers healthcare costs
[1]. The problem of variable selection in biomedicine
is more pressing than ever, due to the recent
emergence of extremely large datasets, sometimes
involving tens to hundreds of thousands of
variables. Such datasets are common in geneexpression array studies, proteomics, computational
biology, text-categorization, information retrieval,
mining
of electronic medical records, consumer
profile analysis, temporal modelling, and other
domains [1-6].
Most variable selection methods are heuristic in
nature and empirical evaluations have seldom
exceeded domains with more than a hundred
variables (see [7-9] and their references for reviews).
Several researchers [1, 10, 11] have suggested,
intuitively, that the Markov Blanket of the target
variable T, denoted as MB(T), is a key concept for
solving the variable selection problem. MB(T) is
defined as the set of variables conditioned on which
all other variables are probabilistically independent of
T. Thus, knowledge of the values of the Markov
Blanket variables should render all other variables
superfluous for classifying T. Technical details about
the distributional assumptions underlying this
intuition, existence and uniqueness of MB(T), and
relations to loss functions and classifier-inducing
algorithms were only recently explored however, by
the first two authors of the present paper [8]. From a
practical perspective, identifying the Markov Blanket
variables has proven to be a challenging task as
evidenced by the limitations of prior methods.
Specifically, the approaches in [1,2] are unsound (i.e.,
provably do not always return the correct MB(T) even
with infinite sample and time); the method of [10] is
sound but relies on inducing the full Bayesian
network and thus does not scale up to the number of
variables; the work in [11] is unsound and has poor
average computational efficiency. Notably, two
newer families of algorithms [8, 12] are sound and
computationally efficient, but require sample
exponential to the size of MB(T). In biomedical
domains sample sizes are typically limited (and often
sample-to-variable ratios are very small), however.
The contribution of the present paper is that it
introduces HITON1, a sound, sample-efficient, and
highly scalable algorithm for variable selection for
classification, based on inducing MB(T). HITON is
sound provided that (i) the joint data distribution is
Faithful to a BN, (ii) the training sample is enough
for performing reliably the statistical tests required by
the algorithm, and that (iii) one uses powerful enough
classifiers (i.e., that can learn any classification
function given enough data). A distribution is faithful
to a BN if all the dependencies in the distribution are
strictly those entailed by the Markov Condition of the
BN [8]. The vast majority of distributions are faithful
in the sample limit [13].
The question that arises is whether the algorithm,
and by extension its assumptions, perform well in
biomedical data (that, in addition, often involve
thousands of variables and limited sample), and the
typical classifiers used in practice. To empirically
evaluate HITON, a wide variety of domains were
selected with different characteristics. In addition, the
best algorithms for each tasks were selected as
baseline comparisons.
1
Pronounced “hee-tόn”. From the Greek Χιτών, for
“cover”, “cloak”, or “blanket”.
AMIA 2003 Symposium Proceedings − Page 21
A Novel Algorithm For Variable Selection
The new algorithm is presented in pseudo-code in
Figure 1. V denotes the full set of variables and ⊥(T ;
X | S ) the conditional independence of T with
variable set X given variable set S.
HITON (Data D; Target T; Classifier-inducer A)
“returns a minimal set of variables required for optimal
classification of T using algorithm A”
MB(T) = HITON-MB(D, T) // Identify Markov Blanket
Vars = Wrapper(MB(T), T, A) // Use heuristic search to
remove unnecessary variables
Return Vars
HITON-MB(Data D, Target T)
“returns the Markov Blanket of T”
PC = parents and children of T returned by
HITON-PC(D, T)
PCPC = parents and children of the parents and
children or T
CurrentMB = PC ∪ PCPC
// Retain only parents of common children and remove
parents of parents, children of parents, and children of
children
∀ potential spouse X ∈ CurrentMB and ∀ Y ∈ PC:
if ¬∃ S ⊆ {Y} ∪ V -{T, X} so that ⊥ (T ; X | S )
then retain X in CurrentMB
else remove it
Return CurrentMB
HITON-PC(Data D, Target T)
“returns parents and children of T”
CurrentPC = {}
Repeat
Find variable Vi ∉ CurrentPC that maximizes
association(Vi, T) and admit Vi into CurrentPC
If there is a variable X and a subset S of CurrentPC
s.t. ⊥(X : T | S)
remove X from CurrentPC;
do not consider X again for admission
Until no more variables are left to consider
Return CurrentPC
Wrapper(Vars, T, A)
“returns a minimal set among variables Vars for
predicting T using classifier-inducer algorithm A and a
wrapping (heuristic search) approach”
Repeat
Select and remove a variable from Vars.
If internally cross-validated performance of A remains
the same, permanently remove the variable.
Until all variables are considered.
Return Vars
Figure 1: Pseudo-code for algorithm HITON.
HITON-MB first identifies the parents and children
of T by calling algorithm HITON-PC, then discovers
the parents and children of the parents and children of
T. This is a superset of the MB(T). False positives are
removed by a statisti (...truncated)