Learning noisy linear classifiers via adaptive and selective sampling (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007%2Fs10994-010-5191-x.pdf

Learning noisy linear classifiers via adaptive and selective sampling

Giovanni Cavallanti 0 Nicol Cesa-Bianchi 0 Claudio Gentile 0 Avrim Blum. 0 C. Gentile DICOM, Universit dell'Insubria , Varese, Italy 3 1 0.73 this convergence rate is asymptotically faster than the rate N (1+)/(2+) achieved by the fully supervised version condition) the gap between the semi- and fully-supervised rates becomes exponential. of the base selective sampler, which queries all labels. Moreover, for (hard margin classification Low noise 1 Introduction In the standard online learning protocol for binary classification the learner receives a sequence of instances generated by an unknown source. Each time a new instance is received the learner predicts its binary label, which is then immediately disclosed before the next instance is observed. This protocol is natural in many applications, for instance weather forecasting or stock market prediction, because Nature (or the market) is spontaneously revealing the true label after each learners guess. However, in many other applications obtaining labels may be an expensive process. In order to address this problem, selective sampling has been proposed as a more realistic variant of the basic online learning protocol. In this variant the true label of the current instance is never revealed unless the learner decides to issue an explicit query. The learners performance is then measured with respect to both the number of mistakes (made on the entire sequence of instances) and the number of queries. A natural sampling strategy is one that tries to identify labels which are likely to be useful to the algorithm, and then queries those labels only. This strategy needs to combine a measure of utility of examples with a measure of confidence. In the case of learning with linear functions a statistic that has often been used to quantify both utility and confidence is the margin. In this work we follow the margin-based approach and define a selective sampling rule that queries the label whenever the margin of the corresponding instance, with respect to the current linear hypothesis, is smaller (in absolute value) than an adaptive threshold. Margins are computed using a linear learning algorithm based on a simple incremental version of regularized linear least-squares (RLS) for classification. This choice is motivated by the fact that RLS margins can be given a natural probabilistic interpretation, thus allowing a principled approach for setting the adaptive threshold. We also investigate a slightly modified sampling criterion for solving online adaptive filtering tasks. In adaptive filtering the true binary label of an instance is revealed only if the learner makes a positive prediction. A natural application domain is document filtering, where instances represent documents and a positive prediction corresponds to forwarding the current document to a user. If a document is forwarded, then the user returns a binary relevance feedback (whether the document was interesting or not), which is assumed to be the documents true label. If the document is not forwarded, that is the filter makes a negative prediction, then its label remains unknown. Transforming our sampling rule into a filtering rule is simple. Since querying corresponds to forwarding, which is in turn equivalent to a positive prediction, the transformed rule forwards all instances with a positive margin getting their true labels as feedback. Moreover, the rule also forwards all instances whose negative margin is smaller than the same adaptive threshold used in selective sampling. By doing this, all the labels of small margin instances are obtained at the price of making some mistakes when forwarding instances with a negative margin. 1.1 Overview of results The main goal of this research is the design of efficient algorithms with a good empirical behavior in selective sampling and filtering tasks. The experiments on a real-world dataset reported in Sect. 3 show that our algorithms compare favorably to other selective sampling and filtering procedures proposed in the literature (Cesa-Bianchi et al. 2006a; Dasgupta et al. 2005; Helmbold et al. 2000; Monteleoni and Kriinen 2007). In order to complement these empirical results with theoretical performance guarantees, we introduce in Sect. 4 a stochastic model defining the distribution of examples (X, Y ). In this model the label conditional distribution (x) = P(Y = 1 | X = x) is a linear function determined by the fixed target vector u Rd . Following a standard approach in statistical learning, we parametrize the instance distribution via the Mammen-Tsybakov condition P(| 12 (X)| ) = O( ). In the standard online protocol, where the true label is revealed after each prediction, we prove in Theorem 1 that the fully supervised RLS converges to the Bayes risk at rate We then prove in Theorem 2 that an adaptive variant of our selective sampling algorithm converges to the Bayes risk at rate with labels being queried at rate When P(| 12 (X)| 0) = 0 for a certain 0 > 0 (the hard margin case), we show that our sampling procedure converges to the Bayes risk at rate of order (ln n)/n with only a logarithmic number of queries, a phenomenon first observed in Freund et al. (1997) and also, under different and more general hypotheses, in Balcan et al. (2006, 2007), Castro and Nowak (2008), Dasgupta et al. (2005), Hanneke (2007). 1.2 Related work Problems related to selective sampling and, more generally, to active learning are well represented in the statistical literature, in particular in the areas of adaptive sampling and sequential hypothesis testing (see the detailed account in Castro and Nowak (2008)). In statistical learning, the idea of selective sampling (sometimes also called uncertainty sampling) has been first introduced by Cohn et al. (1990, 1994)see also Lewis and Gale (1994), Muslea et al. (2000). Castro and Nowak (2008) study a framework in which the learner has the freedom to query arbitrary domain points whose labels are generated stochastically. They prove risk bounds in terms of nonparametric characterizations of both the regularity of the Bayes decision boundary and the behavior of the noise rate in its proximity. The idea of querying small margin instances when learning linear classifiers has been explored many times in different active learning contexts. Campbell et al. (2000), and also Tong and Koller (2000), study a pool-based model of active learning, where the algorithm is allowed to interactively choose which labels to obtain from an i.i.d. pool of unlabeled instances. A landmark result in the selective sampling protocol is the query-by-committee algorithm of Freund et al. (1997). In the realizable (noise-free) case, and under strong distributional assumptions, this algorithm is shown to require exponentially fewer labels than instances when learning linear classifiers (see also Gilad-Bachrach et al. (2005) for a more practical implementation). An exponential advantage in the realiza (...truncated)