We have under our belt two basic machine learning models and their related ideas – for classification problems, the half-space model; for modeling problems, linear regression. Upon considering how those models are determined, in particular how linear regression was determined, there is a very sensible question to ask: How is this learning? In this post1, we’ll take a step back to reconsider the overarching goal for ML model functions — minimizing the mistakes the model will make on yet unseen data. The probability of a mistake is often called the risk of the model, the generalization error, or the (true) loss.
The perspective that considers a model’s loss will make it possible to generate more powerful predictors, and it will provide the setting to make what we are doing seem more like learning.
Contents
Introducing Probability
In the Binary Classification post, I stated that a machine learning task roughly attempts to solve one of two types of problems: a classification problem (given \( {\bf x}\), what type of thing is it?) or a modeling/regression problem (what relationship does \( y\) have to \( {\bf x}\)?). In that discussion, I described the classification problem as an attempt to approximate a “true” labeling function \( h^\star: \mathcal X \to \mathcal Y\) with another function \( h\). For classification, \( \mathcal Y\) is a finite set containing elements that represent each possible type. The modeling problem can be described similarly, with \( \mathcal Y\) being a continuum (for example, \( \mathbb R\) or an interval of \( \mathbb R\)).
Here I describe machine learning tasks from a slightly different, but not incompatible, perspective. In the tasks of ML, uncertainty (arising from incomplete information, from necessary simplifications, or from physical nature itself) must not be assumed away, not completely anyway.
Instead of looking for a function so that \( h({\bf x}) \in\mathcal Y\) for each \( {\bf x}\in\mathcal X\), perhaps we make a random variable for each \(\bf x\). That is, we consider a sample space of observations of getting the point \(\bf x\) (from the data setting), rather than just \({\bf x}\in\mathbb R^n\). The target space is the labels \(\mathcal Y\) (or a way of placing the labels in \(\mathbb R\)), and the probability mass function (pmf)2 defined on \(\mathcal Y\) is often thought of as being the conditional probability \(p(y\ |\ {\bf x})\). (See this post for a refresher on random variables.)
The pmf (or pdf, possibly) determined by \({\bf x}\) is sometimes called the distribution, given \({\bf x}\), so I will use the notation \( D_{\bf x}: \mathcal Y \to \mathbb R\) for this pmf. If it is a pmf, then we understand its properties by summing over labels in \(\mathcal Y\); if it is a pdf, then we use integrals over \(\mathcal Y\) instead. Merely to simplify notation, from now on I will write statements in terms of integrals over \(\mathcal Y\), or a part of \(\mathcal Y\). If, in an example, the random variable is discrete, then I ask you, the reader, to make it the appropriate sum in your head. So, we know that
\( {\displaystyle\int_{y\in\mathcal Y} D_{\bf x}(y)\ dy = 1}.\)
I already discussed one model in this way, in the topic of the logistic regression model. The sigmoid function that was used, \( f({\bf x}) = \sigma(k({\bf w}\cdot{\bf x} + b))\), did not have values in \( \mathcal Y = \{+1,-1\}\). Instead, you think of the model assigning to \( {\bf x}\) the distribution \( D_{\bf x}(1) = f({\bf x})\), \( D_{\bf x}(-1) = 1 – f({\bf x})\). 3 Knowing this distribution, one could get a deterministic function so that \(h({\bf x})=1\) when \(f({\bf x}) > 0.5\) and \(h({\bf x})=-1\) otherwise (an instance of using MLE, I suppose). However, the value of \(f({\bf x})\) gives additional information, about your certainty of the label.
Returning to the ML task, assume still that there is a true labeling function. So for each \( {\bf x}\in\mathcal X\) there is a label \( h^\star({\bf x}) \in\mathcal Y\). However, this function is unknown and, for interesting tasks, unknowable. There is a joint probability distribution \( \mathcal D\) on \( \mathcal X \times\mathcal Y\) — a random variable for observations and a random variable for labels — so that observation, label pairs are drawn from \(\mathcal D\), so \(({\bf x}, h^\star({\bf x})) \sim\mathcal D\). Notice that \( D_{\bf x}\) is the conditional distribution, for \( \mathcal X={\bf x}\).
One can only access a training set \( \mathcal X^{train}\) and its labels \( \mathcal Y^{train}\); these were sampled from the distribution \( \mathcal D\), which is also unknown (if we knew \( \mathcal D\) and \( h^\star\) already, then there is no need for ML!).
The goal of an ML algorithm is to produce a model that approximates \( h^\star\) well, with the only information being \( \mathcal X^{train}\) and \( \mathcal Y^{train}\). To make “approximates well” more concrete, a model \( h\) will have an error or loss. The loss can be thought of as follows: sample \( ({\bf x}, y)\) from \( \mathcal D\), compute the model’s label \( h({\bf x})\). What is the probability that it was wrong? It is the measure, with respect to \( \mathcal D\), of those \( {\bf x}\) where \( h({\bf x}) \ne h^\star({\bf x})\). This probability of being wrong is the risk, true loss, or generalization error of \( h\) and is equal to
\( L_{\mathcal D, h^\star}(h)\) ≝ \( {\bf P}( h({\bf x}) \ne h^\star({\bf x}) )\), when \(({\bf x}, h^\star({\bf x})) \sim \mathcal D\)
Empirical Risk Minimization
It is impossible to compute \( L_{\mathcal D, h^\star}(h)\). Is there something that can be computed, and would represent a best guess to the loss? Since only the training set is accessible, you could compute the number of training labels the model gets wrong, divided by the number of training points:
\( L_{S}(h)\) ≝ \( {\displaystyle\frac{|\{i : h({\bf x_i}) \ne y_i\}|}{m}}.\)
The \( S\) is for notational simplicity, but is just the training set, \( S = \mathcal X^{train}\), with the letter \( S\) standing for “sampled”; the size of \(S\) is \(m\). Each \(\bf x_i\) in the equation above is one of the points in \(S\). The function \( L_{S}(h)\) is often called the empirical risk or training error of \( h\).
With a measure for the training error, it seems intuitive to find an \( h\) so that \( L_{S}(h)\) is minimized. This is the paradigm of Empirical Risk Minimization (ERM).
ERM alone is not enough
Consider the binary classification problem, \( \mathcal Y = \{1, -1\}\). Given training data \( S=\mathcal X^{train} = [{\bf x_1},\ldots,{\bf x_m}]^\top\) and \( \mathcal Y^{train}=(y_1,\ldots,y_m)\), you could try using the following function to classify a point \(\bf x\):
\( h^{\times}({\bf x}) = \begin{cases}y_i,\ \text{if }{\bf x} = {\bf x_i} \\ -1,\ \text{otherwise}\end{cases}\).
Notice that \( L_{S}(h^{\times})\) is always equal to 0, for any sampled training set coming from a distribution \( \mathcal D\). That seemed too easy; you should be skeptical.
- Does \( h^{\times}\) actually give a bad classifying function, even though its training error is zero?
Consider an example situation. It will hopefully make the absurdity of the model \( h^{\times}\) apparent.
Suppose that the instance space, the set of potential data points, is \( \mathcal X = [0,1]\times[0,1]\) (a data point could potentially be any point in the plane which has both of its coordinates between 0 and 1). The task is to decide if a point is outside the red square depicted below: if a point is in the square, its correct label is -1, and a point outside the square has label +1.
Suppose that \( \mathcal X^{train}\) is the set of 50 points that are displayed. There are 37 points outside the upper-left, red square and 13 points inside it. Then the definition of \( h^{\times}\) causes it to assign +1 to the 37 points that are blue and -1 to those which are red.
Spot check. By its definition, what is \(h^{\times}({\bf x})\) for some point \(\bf x\in\mathcal X\) that is not in the set of 50 training points?
The points in \(\mathcal X\) are in the plane and uniformly distributed. Since a set of 50 points has area zero in the plane, the probability of picking one of them in a random draw from \(\mathcal X\) is zero. So, being correct on those points is immaterial to the generalization error. It is necessarily the case that \( L_{\mathcal D, h^\star}(h^\times) = 3/4\): with probability one, points in the square and out of the square get label -1, making \( h^\times\) wrong 3/4 of the time.
While the example may have seemed somewhat artificial, it does clearly illustrate the issue with the classifier \( h^{\times}\). So, the approach that simply attempts to minimize empirical risk (with no requirement on the function used, at least) produces bad results.
Moving forward
In a way, one that must be overcome, the function \( h^{\times}\) focused too much on the training set, and on getting 0 for the training error. What might be done to improve how the model works on non-training data?
The model function \( h^{\times}\) is an example of overfitting. Another example can occur in the polynomial generalization of linear regression. Say you use a degree \( m-1\) polynomial with a training set of size \( m\). Such a polynomial would fit the training data perfectly (for essentially every training set), so it would have training error 0. But this happens simply because you sampled \(m\) points from \(\mathcal D\), not due to some relation between points in \(\mathcal X\) and those in \(\mathcal Y\) (arising from what the distribution \( \mathcal D\) looks like). Such a high degree polynomial tells you essentially nothing about \( \mathcal D\), and so the regression will have high generalization error.
What can be done to avoid overfitting since only the training data can be accessed? A valuable approach, somewhat surprisingly, is to restrict the types of model functions allowed. This is called introducing bias. By saying the model must be of a certain type, you only allow for a selected kind of relationship between \( {\bf x}\) and \( \hat y\). The next post will discuss introducing bias.
- This post follows the first part of Chapter 2 in the Shalev-Shwartz and Ben-David book (see reference books). ↩
- Or the probability density function (pdf), in the event of a continuous random variable. ↩
- So, \(\bf x\) determines a Bernoulli distribution. ↩