Learning noisy linear classifiers via adaptive and selective sampling
Giovanni Cavallanti
0
Nicol Cesa-Bianchi
0
Claudio Gentile
0
Avrim Blum.
0
C. Gentile DICOM,
Universit dell'Insubria
, Varese,
Italy
3 1 0.73 this convergence rate is asymptotically faster than the rate N (1+)/(2+) achieved by the fully supervised version condition) the gap between the semi- and fully-supervised rates becomes exponential. of the base selective sampler, which queries all labels. Moreover, for (hard margin classification Low noise
1 Introduction
In the standard online learning protocol for binary classification the learner receives a
sequence of instances generated by an unknown source. Each time a new instance is received
the learner predicts its binary label, which is then immediately disclosed before the next
instance is observed. This protocol is natural in many applications, for instance weather
forecasting or stock market prediction, because Nature (or the market) is spontaneously
revealing the true label after each learners guess. However, in many other applications
obtaining labels may be an expensive process.
In order to address this problem, selective sampling has been proposed as a more realistic
variant of the basic online learning protocol. In this variant the true label of the current
instance is never revealed unless the learner decides to issue an explicit query. The learners
performance is then measured with respect to both the number of mistakes (made on the
entire sequence of instances) and the number of queries.
A natural sampling strategy is one that tries to identify labels which are likely to be
useful to the algorithm, and then queries those labels only. This strategy needs to combine
a measure of utility of examples with a measure of confidence. In the case of learning with
linear functions a statistic that has often been used to quantify both utility and confidence is
the margin.
In this work we follow the margin-based approach and define a selective sampling rule
that queries the label whenever the margin of the corresponding instance, with respect to the
current linear hypothesis, is smaller (in absolute value) than an adaptive threshold. Margins
are computed using a linear learning algorithm based on a simple incremental version of
regularized linear least-squares (RLS) for classification. This choice is motivated by the
fact that RLS margins can be given a natural probabilistic interpretation, thus allowing a
principled approach for setting the adaptive threshold.
We also investigate a slightly modified sampling criterion for solving online adaptive
filtering tasks. In adaptive filtering the true binary label of an instance is revealed only if
the learner makes a positive prediction. A natural application domain is document filtering,
where instances represent documents and a positive prediction corresponds to forwarding
the current document to a user. If a document is forwarded, then the user returns a binary
relevance feedback (whether the document was interesting or not), which is assumed to be
the documents true label. If the document is not forwarded, that is the filter makes a negative
prediction, then its label remains unknown. Transforming our sampling rule into a filtering
rule is simple. Since querying corresponds to forwarding, which is in turn equivalent to a
positive prediction, the transformed rule forwards all instances with a positive margin getting
their true labels as feedback. Moreover, the rule also forwards all instances whose negative
margin is smaller than the same adaptive threshold used in selective sampling. By doing this,
all the labels of small margin instances are obtained at the price of making some mistakes
when forwarding instances with a negative margin.
1.1 Overview of results
The main goal of this research is the design of efficient algorithms with a good empirical
behavior in selective sampling and filtering tasks. The experiments on a real-world dataset
reported in Sect. 3 show that our algorithms compare favorably to other selective sampling
and filtering procedures proposed in the literature (Cesa-Bianchi et al. 2006a; Dasgupta et
al. 2005; Helmbold et al. 2000; Monteleoni and Kriinen 2007).
In order to complement these empirical results with theoretical performance guarantees,
we introduce in Sect. 4 a stochastic model defining the distribution of examples (X, Y ). In
this model the label conditional distribution (x) = P(Y = 1 | X = x) is a linear function
determined by the fixed target vector u Rd . Following a standard approach in
statistical learning, we parametrize the instance distribution via the Mammen-Tsybakov condition
P(| 12 (X)| ) = O( ).
In the standard online protocol, where the true label is revealed after each prediction, we
prove in Theorem 1 that the fully supervised RLS converges to the Bayes risk at rate
We then prove in Theorem 2 that an adaptive variant of our selective sampling algorithm
converges to the Bayes risk at rate
with labels being queried at rate
When P(| 12 (X)| 0) = 0 for a certain 0 > 0 (the hard margin case), we show that
our sampling procedure converges to the Bayes risk at rate of order (ln n)/n with only a
logarithmic number of queries, a phenomenon first observed in Freund et al. (1997) and
also, under different and more general hypotheses, in Balcan et al. (2006, 2007), Castro and
Nowak (2008), Dasgupta et al. (2005), Hanneke (2007).
1.2 Related work
Problems related to selective sampling and, more generally, to active learning are well
represented in the statistical literature, in particular in the areas of adaptive sampling and
sequential hypothesis testing (see the detailed account in Castro and Nowak (2008)). In statistical
learning, the idea of selective sampling (sometimes also called uncertainty sampling) has
been first introduced by Cohn et al. (1990, 1994)see also Lewis and Gale (1994), Muslea
et al. (2000).
Castro and Nowak (2008) study a framework in which the learner has the freedom to
query arbitrary domain points whose labels are generated stochastically. They prove risk
bounds in terms of nonparametric characterizations of both the regularity of the Bayes
decision boundary and the behavior of the noise rate in its proximity.
The idea of querying small margin instances when learning linear classifiers has been
explored many times in different active learning contexts. Campbell et al. (2000), and also
Tong and Koller (2000), study a pool-based model of active learning, where the algorithm
is allowed to interactively choose which labels to obtain from an i.i.d. pool of unlabeled
instances. A landmark result in the selective sampling protocol is the query-by-committee
algorithm of Freund et al. (1997). In the realizable (noise-free) case, and under strong
distributional assumptions, this algorithm is shown to require exponentially fewer labels than
instances when learning linear classifiers (see also Gilad-Bachrach et al. (2005) for a more
practical implementation). An exponential advantage in the realiza (...truncated)