ERM with no assumptions. Last post covered the concept of Empirical Risk Minimization (ERM) for finding a model function \( h:\mathcal X \to \mathcal Y\), with a focus on binary classification. Given a training set \( \mathcal X^{train} = S\), the empirical risk \( L_S(h)\) is the fraction of points from \( S\) that \( h\) incorrectly labels. The ERM rule is to use a function, a classifier, that minimizes the empirical risk (training error). However, the example in that post (classifying points in the upper-left square) provided an important lesson:
( small training error ) \( \;\not\!\!\!\implies\) ( small generalization error ).
At least, this is so when \( h\) is permitted to be any possible function — when no assumption is made on \( h\).
To rephrase this setback: though only the training set is accessible, the goal is to learn (approximately) the true distribution \( \mathcal D\). Thinking of nothing else beyond making the training error zero will be insufficient. This may seem insurmountable, but remember that the training set is sampled from \( \mathcal D\) and so it is to be expected (on average) that \( S\) reflects something about \( \mathcal D\), provided the size of the training set is sufficiently large.
ERM with Bias (or, use Science). I’ll begin with a clarification, hopefully one that you assumed. The word bias discussed here is not meant with the negative connotation that it has in a cultural context. The two uses do, however, have similar semantic origin — some prior assumption(s) is(are) being made. In the case of ERM, the assumptions are about the data and \( \mathcal D\). Essentially, to introduce bias, you make the set of possible model functions smaller.
It is, perhaps at first, a bit counter-intuitive, but restricting the set of available classifying functions gives better generalization results. If I were a statistician, I would say that the reason is noise in the data1 2. I am not a statistician, though.
I would like to take an alternate discussion route (one that dovetails well with the idea of signal vs. noise, I think). From centuries of scientific thought came a hard-won lesson: deciding on the hypothesis after the fact, after examining the data, makes for terrible inferences. The results won’t generalize (in scientific language, won’t be repeatable). That is the essence of the scientific method; and ML is supposed to be, after all, part of data science.
The scientific method: Make a hypothesis before looking at the observations (data). Careful analysis afterwards either supports or rejects the hypothesis.
In the ML setting, deciding that you will only work with certain types of functions is a hypothesis about the true distribution of the data (with labels), \( \mathcal D\). Call the set of possible functions that you might get from your algorithm a hypothesis class.
For example, in the binary classification setting, if your hypothesis class (the set of functions that are possible through the algorithm you use) are half-space models, then you are making a hypothesis that the data is linearly separable. Good sample data and a large error should make you reach the conclusion that the population data is not linearly separable.
I will write \( \mathcal H\) for the hypothesis class of functions. For the purposes of the sample \( S\), and the task under consideration, the empirical risk is minimized within \( \mathcal H\), so the function \(h\) that you use satisfies \(h\in\mathcal H\). The testing of the hypothesis involves taking the \(h\) obtained from ERM within \(\mathcal H\) and computing the loss on a test set of data (which was not used to determine \(h\)).
Already, this represents one aspect of learning. If the ERM rule within \( \mathcal H\) produces bad results, then you learn that \( \mathcal H\) is not a good reflection of the true picture, not a good reflection of \( \mathcal D\). Another aspect of learning will come up in the future, when \( L_S(h)\) (something related to it, rather) will be used to “update” the current \( h\) in an attempt to minimize the loss.
1. (Sorry if I just put words into statisticians’ mouths.)
2. The nature of noise is that it does not have predictive value, but it occurs on top of a phenomenon/relationship that can be used for prediction (a signal). The training data (which are observations) have that noise “baked-in.” The goal is to model the signal, not the noise also.
A class \( \mathcal H\) where ERM gives low generalization error, probably.
The goal of this section is to prove that if there are only finitely many functions in \( \mathcal H\), then the ERM rule will give a model that approximately knows the distribution \( \mathcal D\) (i.e. has low true loss on the distribution), with high probability3. This will be done under some assumptions on \( \mathcal H\) and on \( \mathcal D\) and \( S\). It will also concern binary classification problems. Some of the assumptions could be loosened, but are there for simplicity.
To be more precise, let \( h_S \in \mathcal H\) be a model given by the ERM rule for the training set \( S\): \( h_S\) minimizes empirical risk among functions in \( \mathcal H\). The theorem to prove is:
Theorem. Given \( 0 < \delta < 1\) and \( \varepsilon > 0\), then for any \( \mathcal D\) and \( \mathcal H\) (within the assumptions), there is a sufficiently large size of \( S\) such that \( L_{\mathcal D, h^\star}(h_S) \le \varepsilon\), with probability \( 1 – \delta\).
The assumptions.
⁍ Finitely many hypotheses: The set \( \mathcal H\) is finite. Denote its size by \( |\mathcal H|\).
⁍ Realizability: There is a true labeling function in \( \mathcal H\). IOW, there is an \( h^\star \in \mathcal H\) with true loss equal to zero. (Note: the algorithm, which uses \( S\) to determine \(h_S\), might find an \( h_S \ne h^\star\).)
⁍ Independently, identically distributed (i.i.d.) sampling: Each point in \( S\) is sampled only according to the fixed distribution \( \mathcal D\), regardless (independent) of the sampling of the other points.
🐉🐉 Let \( m\) be the number of points in a training set. The i.i.d. sampling assumption implies that the probability of choosing \( S\) is the same as the probability of choosing a length \( m\) list \( [x_1,x_2,\ldots, x_m]\), with each \( x_i\) sampled according to \( \mathcal D\). For example, if \( \mathcal X\) is finite (with \( N\) points, say) and \( \mathcal D\) is the uniform distribution, then each \( x_i\) has probability \( 1/N\), so the probability of each list is \( 1/N^{m}\). (It is possible that \( x_i\) repeats; this is sampling with replacement.) The notation \( S \sim \mathcal D^m\) is sometimes used. For the entire proof, we assume that \( S\sim\mathcal D^m\) every time we write \( S\).
Proof: The proof hinges on identifying a revealing pair of subsets: a subset of “bad” hypothesis functions in \( \mathcal H\) and a subset of “misleading” training samples. The bad set of hypothesis functions is
\( \mathcal H_B = \{h \in \mathcal H\ |\ L_{\mathcal D, h^\star}(h) > \varepsilon\}.\)
Note that a function in this set will have high loss (higher than we want), no matter what training set gets used. The subset of misleading training sets is
\( M = \{S\ |\ \text{there is }h \in \mathcal H_B \text{ where }L_S(h) = 0\}.\)
If you think about the set definition for \( M\), it contains all the training sets that could produce an \( h_S \in\mathcal H_B\) under the ERM rule4. So, for a training set in \( M\) there is a bad hypothesis that looks like a good one, according to \( S\).
We want \( \delta\) to be an upper bound on the probability that a training set comes from \( M\), and it produces an \( h_S \in \mathcal H_B\). That is, we are looking to bound the probability of
\( \{S\ |\ L_{\mathcal D, h^\star}(h_S) > \varepsilon\} = \{S\ |\ S\in M\text{ and }h_S\in\mathcal H_B\} \subset M.\)
Since the probability of a sample being in a union of sets is not more than the sum of the probabilities of being in any one of them, and since
\( M = {\displaystyle \bigcup_{h\in\mathcal H_B} \{S\ |\ L_S(h) = 0\} },\)
we have shown that
\( {\displaystyle {\bf P}(S\ |\ L_{\mathcal D,h^\star}(h_S) > \varepsilon) \le {\bf P}(S\ |\ S\in M) \le \sum_{h\in\mathcal H_B}{\bf P}(S\ |\ L_S(h)=0)}.\)
Since points in \( S\) are sampled i.i.d. it is the case that for each \( h\),
\( {\displaystyle {\bf P}(S\ |\ L_S(h)=0) = \prod_{i=1}^m {\bf P}(x_i\sim\mathcal D\ |\ h(x_i)=h^{\star}(x_i))}.\)
For any particular \( h\in\mathcal H_B\), the probability that it correctly labels a single \( x_i\) is equal to \( 1 – L_{\mathcal D,h^\star}(h)\), which is less than \( 1 – \varepsilon\) by the definition of \( \mathcal H_B\). Notice that \( 1-\varepsilon \le e^{-\varepsilon}\) (by the convexity of the exponential function, and its derivative being equal to -1 at 0: Figure! \( \to\)). This means the probability that one \( h\in\mathcal H_B\) will correctly label every point in a size \( m\) sample is less than \( e^{-\varepsilon m}\). Therefore, we have shown that
\( {\displaystyle {\bf P}(S\ |\ L_{\mathcal D, h^\star}(h_S) > \varepsilon) < \sum_{h\in\mathcal H_B}e^{-\varepsilon m} \le |\mathcal H|e^{-\varepsilon m}}.\)
If we let \( {\displaystyle m \ge \frac{\ln(|\mathcal H|/\delta)}{\varepsilon}}\), then this implies that the probability that \( L_{\mathcal D,h^\star}(h_S) \le \varepsilon\), which equals \( 1 – {\bf P}(S\ |\ L_{\mathcal D, h^\star}(h_S) > \varepsilon)\), is larger than
\( 1 – |\mathcal H|\text{exp}(-\varepsilon m) \ge 1 – |\mathcal H|\text{exp}(\ln(\delta/|\mathcal H|)) = 1 – \delta. \quad\quad\square\)
⏚
3. The theorem in this section, and its proof, is found in Chapter 2 in the Shalev-Shwartz and Ben-David book (see reference books).
4. All, because of the Realizability assumption — since \( h^{\star}\in\mathcal H\) has \( L_S(h^\star)=0\) for all \( S\), the only way to get a bad model is for it to also have zero training error on \( S\).
Hypothesis class sizes, in theory and practice. Consider the setting of linear regression on points in the plane. The hypothesis class is the set of functions
\( \mathcal H = \{x\mapsto wx + b\ |\ b, w \in\mathbb R\}.\)
Since two lines cannot agree if their ordered pairs \( (b, w)\) do not agree, this hypothesis class is in bijective correspondence with \( \mathbb R^2\), an infinite set. For linear regression with instances \( {\bf x_i} \in \mathbb R^d\), the model function takes the form \( {\bf w}\cdot{\bf x} + b\), where \( {\bf w} \in \mathbb R^d\). Thus, in that more general case, the hypothesis class for linear regression corresponds to points in \( \mathbb R^{d+1}\).
The hypothesis class for the half-space model, with instances in \( \mathbb R^d\), is in correspondence with points a space that is “close to” \( \mathbb R^{d}\). What about logistic regression models?
Point of order: while this discussion of infinite hypothesis classes is true in theory, computers use floating-point numbers, and so a computer system only has access to finitely many numbers for each coordinate. The implication is that, in practice, these hypothesis classes are actually finite.
Importantly, finite does not mean small. A typical computer (with 64-bit processor) has a precision of approximately 16 decimal digits (53 bits) and an exponent up to about 3 decimal digits (11 bits). This means that for each exponent there are a little less than \( 10^{16}\) numbers (there is a bit of overlap as you vary exponents, but it is small). So, roughly, there are \( 10^{16}\) distinct numbers for each exponent of 10 and there are approximately 1000 exponents. This makes the number of available numbers about \( 10^{19}\).
Thus, when doing half-space models (in practice) with instances having dimension \( d\), the set \( \mathcal H\) has size on the order of \( 3(10^{19d})\) — the model is determined by a \( (d+1)\)-tuple \( ({\bf w}, b)\), each coordinate has about \( 10^{19}\) possibilities, and some of these tuples determine the same function5. If considering the above theorem, recall the realizability assumption. If the “true” distribution is linearly separable with a separating hyperplane in \( \mathcal H\), then the theorem applies.
For example, under the assumptions and in dimension two, the theorem guarantees the resulting half-space model is accurate 99% of the time (\( \varepsilon = 0.01\)) with probability 99% (\( \delta=0.01\)) if the sample \( S\) has at least \( 100\cdot\ln(300\cdot 10^{38}) \approx 9,300\) points6. (The bound is pretty rough; in fact, you don’t need nearly so many points for the highly probable, high accuracy.)
OTOH, recall the model function, discussed near the end of the last post, for deciding that a point in \( [0,1]\times [0,1]\) is not in the upper-left square:
\( h_S(x) = \begin{cases}y_i,\ \text{if }x = x_i \\ -1, otherwise.\end{cases}\)
Allowing for \( S\) to be any possible training set, this also defines a finite hypothesis class, in practice (since floating point numbers only allow for finitely many points in the space). However, the class is huge. Does a large \( S\) guarantee 99% accuracy with probability over 99%?
Note: there are approximately \( 10^{16}\) numbers for each coordinate, so there are \( 10^{32}\) points in the instance space. Even if \( |S| = 10^{31}\) (an absurdly large training set), the function \( h_S\) will still only be accurate on \( 10^{31} + (9\cdot 10^{31})/4 = (1/10 + 9/40)\cdot 10^{32}\) points, which is a loss of \( 0.675 = 1 – 0.325\) (it is wrong 67.5% of the time).
5. Multiplying \( b\) and all the coordinates of \( {\bf w}\) by the same positive number does not change the half-space model function. That means pairs \( ({\bf w}, b)\) and \( ({\bf w’}, b’)\) determine the same element of \( \mathcal H\) if there is a number \( c>0\) so that \( {\bf w} = c{\bf w’}\) and \( b = cb’\). Thus every pair \( ({\bf w}, b)\) determines the same model as \( ({\bf w’}, 1)\), \( ({\bf w’}, 0)\) or \( ({\bf w’}, -1),\) where \( |b|{\bf w’} = {\bf w}\) if \( |b|\ne 0\) and \( {\bf w’} = {\bf w}\) otherwise. The count then follows, there being \( 10^{19d}\) possible vectors \( {\bf w’}\).
6. Note that size of \( S\) needed grows linearly in the dimension, with a large slope if \( \varepsilon\) is small.