Machine Learning Theory: Binary Classification

For an overview of the theory of binary classification, see Neural Network Theory Theoretical Foundations, Foundations of Machine Learning, or Understanding Machine Learning From Theory to Algorithms.

PAC Learning Theory

Overview of results in PAC learning theory (from the introduction of The sample complexity of agnostic learning under deterministic labels):

The analysis of the sample complexity in the PAC model is divided into two main cases, according to whether the class H contains a zero-error classifier or not. In the first case, usually referred to as the realizable setup, the sample complexity of learning H is (roughly) Θ(VCdim(H)/\epsilon). In the second case, the agnostic setup, no such assumption is made. In particular, the labeling function of the underlying data-generating distribution is not necessarily deterministic. In this case, the sample complexity is (roughly) Θ(VCdim(H)/\epsilon^2). Proving the lower bound in the agnostic case usually involves taking advantage of the stochasticity in the labeling function. Non-realizability, the lack of a zero-error classifier in the learner’s search space, can be caused by stochasticity, or “noise”, of the labeling rule (with respect to the features considered) or by the limitations of the learner’s search space. The first case has attracted a lot of attention in the past decade. Mammen and Tsybakov (1999) initiated some finer grained analysis. Assuming that the Bayes optimal classifier is a member of the class, they analyze the sample complexity of ERM classifiers and prove a bound on their error rates that interpolate between the realizable and the agnostic case. These bounds involve a parameter, Tsybakov’s noise exponent, usually denoted by α, that restricts the amount of noise in the labeling. Table 1 illustrates the relationship between these three cases (realizable, fully agnostic, and realizable with controlled noise).

And later from the related work:

The PAC framework for binary classification learning was first introduced by Valiant (1984). Blumer et al. (1989) characterize learnability of a binary hypothesis class in terms of its VC dimension. Essentially, this characterization goes back to Vapnik and Chervonenkis (1971). The agnostic PAC model was introduced by Haussler (1992). In both cases, the sample complexity of any empirical risk minimization (ERM) learner is equal to that of the best possible learning algorithm. The gap between the error rates in the realizable and the agnostic case has attracted quite a lot of attention. These can be viewed along two directions. An assumption of a small approximation error by a class allows leveraging Bennett and Bernstein inequalities and yields bounds that imply fast rates in the range of generalization error that is higher than the approximation error of the class. This analysis goes back to Vapnik and Chervonenkis(1971). More recently, Mammen and Tsybakov (1999) considered the setting in which the Bayes classifier belongs to the class H, and therefore the stochasticity of the labeling (or its “noise”) is the only source of the approximation error. They introduce the Tsybakov noise condition, a bound on the stochasticity (or noisiness) of the labels and prove convergence rates under that condition (and the additional assumption that the Bayes optimal classifier is a member of the hypothesis class). Tsybakov (2004) generalizes these results to the case where the Bayes classifier is only assumed to be a member of some collection of known hypothesis classes. Boucheron et al. (2005) provide an analysis (under the Tsybakov noise condition) that does not impose restrictions on the Bayes classifier. However the obtained convergence rates depend on the approximation error of the class (they become weaker as the approximation error grows). Our results indicate that the relationship between noise and fast rates, as captures by these conditions, is indeed restricted to the case where the Bayes classifier is in the class (or is very well approximated by a member of the class).