The previous post introduced having a hypothesis class \( \mathcal H\), and the model \( h_S\) determined from a training set \( S = \mathcal X^{train}\). In practice, think of choosing an algorithm that works on a training set and determines a model (for example, the linear algebra procedure to determine the LSR line from data). 1 If \( \mathcal H\) is more restricted (has more bias), then the assumptions are stronger about how the data is really distributed. Of course, introducing too much bias will hurt the predictive potential of the resulting model, as well.
In this post, we will see a decomposition of the expected value of a loss function (the MSE in regression) into the Bias and the Complexity of \( \mathcal H\), and we will see the need to balance them in order to get high quality results from machine learning algorithms.
Contents
Not complex enough.
As a spot check, consider an extreme case where \( \mathcal H\) contains only constant functions.
Binary classification: would mean the model can only output +1 for every instance, or output -1 for every instance (so \( |\mathcal H| = 2\)). That is bad news if the true distribution of the data, \( \mathcal D\), has both labels +1 and -1 on significantly many points in \( \mathcal X\).
Linear regression: would mean you are restricting to horizontal regression lines (since having constant \( y\)-values on the line gives a constant model output). For simplicity, suppose that the source of the data requires the \( y\)-values to be in some interval \( [a,b]\), and the distribution of \( y\)-values is uniform in that range. 2
Among all possibilities for \( S\), it is more likely that observed \( y\)-values in \( S\) will have an average that is close to the overall average \( \bar y = \frac{a+b}2\), than it is that their average would not be near \( \bar y\). 3 In fact, the expected horizontal regression line would be \( y = \bar y\). Since data is distributed uniformly throughout the range, this expected model will have high variance between the observations and the model’s prediction. That is,
\( {\displaystyle\frac{1}{m}\sum_{i=1}^m (\hat y_i – y_i)^2 = \frac{1}{m}\sum_{i=1}^m (\bar y – y_i)^2}\)
will be large. This is a simple example of the Bias being large. It is somewhat like a gambler noticing that the average outcome of a (single-zero) roulette wheel is 18, and deciding to always bet on 18.
Complexity and the Bias-Complexity decomposition.
Suppose that the MSE is being used to measure error (it is the loss function) of the model. 4 To use MSE in order to understand the expected generalization error, rather than understanding it for just one model, it is necessary to consider the variance between \( h_S({\bf x})\) and \( y\) not only over the instance space \( \mathcal X\), but also over the set of possible training sets (each of which gives rise to a particular model \( h_S\)).
Notationally,5 you want to understand the average \( \mathbb E_{S, {\bf x}, y}[(h_S({\bf x}) – y)^2]\). We will break this expression up in a couple of ways in order to express it in pieces that involve how you expect the algorithm and its resulting model to perform on average.
Aside: Let \( \bar h\) be the expected model function as you vary the training set. This can be difficult to wrap your head around, since it is a weighted average of functions. The mathematics behind this boils down to having a “measure” on the set of functions (a way of giving some subsets a size), and that allows you to integrate something that produces a function as its output.
In practice, you can approximate \( \bar h\) as follows. Take a “sufficiently large” number of training samples \(S_1,S_2,\ldots,S_N\), with each containing the same number of points, drawn i.i.d. from the data distribution \( \mathcal D\). If \( h_i\) is the model determined by \( S_i\), then for each \( {\bf x}\)
\( {\displaystyle \bar h({\bf x}) \approx \frac{1}{N}\sum_{i=1}^Nh_i({\bf x})}.\)
🐉🐉 Now, back to the expected MSE. By expanding what is being squared we get
\( \mathbb E_{S, {\bf x}, y}[(h_S({\bf x}) – y)^2] = \mathbb E_{S,{\bf x},y}[(h_S({\bf x}) – \bar h({\bf x}) + \bar h({\bf x}) – y)^2]\)
\( =\mathbb E_{S,{\bf x}}[(h_S({\bf x}) – \bar h({\bf x}))^2] + \mathbb E_{{\bf x},y}[(\bar h({\bf x}) – y)^2] + 2\mathbb E_{S,{\bf x},y}[(h_S({\bf x}) – \bar h({\bf x}))(\bar h({\bf x}) – y)].\)
By definition \( \bar h = \mathbb E_S[h_S]\), and also the factor \( (\bar h({\bf x}) – y)\) is independent of changing \( S\). So the mixed term equals 0:
\( \mathbb E_{S,{\bf x},y}[(h_S({\bf x}) – \bar h({\bf x}))(\bar h({\bf x}) – y)] = \mathbb E_{{\bf x}, y}\left[\mathbb E_S[h_S({\bf x}) – \bar h({\bf x})](\bar h({\bf x}) – y)\right] = 0.\)
And so we have shown, \(\mathbb E_{S,{\bf x},y}[(h_S({\bf x}) – y)^2] = \mathbb E_{S,{\bf x}}[(h_S({\bf x}) – \bar h({\bf x}))^2] + \mathbb E_{{\bf x},y}[(\bar h({\bf x}) – y)^2]\).
Letting \( \bar y({\bf x}) \overset{def}{=} \int_{\mathcal Y}y D_{\bf x}(y)\ dy\), the expected value of \( y\) given \( {\bf x}\), we note that
\( \mathbb E_{{\bf x},y}[(\bar h({\bf x}) – y)^2] = \mathbb E_{{\bf x},y}[(\bar h({\bf x}) – \bar y({\bf x}) + \bar y({\bf x}) – y)^2].\)
Using the definition of \( \bar y({\bf x})\), and fact that \( \bar h({\bf x}) – \bar y({\bf x})\) is independent of \( y\), the mixed term when you expand also vanishes. The end result is that the expected generalization error \( \mathbb E_{S,{\bf x},y}[(h_S({\bf x})-y)^2]\) is equal to
\( \underbrace{\mathbb E_{S,{\bf x}}[(h_S({\bf x})-\bar h({\bf x}))^2]}_{\text{Complexity}}+ \underbrace{\mathbb E_{\bf x}[(\bar h({\bf x})-\bar y({\bf x}))^2]}_{\text{Bias}^2}+ \underbrace{\mathbb E_{{\bf x},y}[(\bar y({\bf x})-y)^2]}_{\text{Noise}}.\)
⏚
Takeaway
The three terms of the decomposition are the Complexity (Variance), the Bias, and the Noise. Before discussing each of them, notice that none of them are ever negative and they sum to the expected error. This means that, if one wishes to keep expected error constant, an increase in Complexity requires a decrease in Bias and vice-versa. Moreover, since none of them can be less than zero, you want to avoid an overly high Complexity, regardless of Bias (and likewise avoid an overly high Bias).
The Noise measures the average variance of the functions \( D_{\bf x}\). It is not affected at all by \( \mathcal H\) and is intrinsic to the data being studied. As such, it represents an unavoidable lower bound on generalization error.
The Complexity measures how much variation in \( h_S\) can be caused by varying the training data. The more specialized that \( h_S\) is to a particular \( S\), the larger the Complexity. Having high Complexity means you are overfitting — the functions in \(\mathcal H\) that are being selected, based on sampled training data, have more degrees of freedom than the data distribution warrants. 6
The Bias measures how much the expected \( y\)-values vary from the average prediction from models in \( \mathcal H\). That is, after averaging out the variation resulting from different training sets, it is the variance from the expected labels. This part of the error will be larger when \(\mathcal H\) does not contain enough (or not complicated enough) potential functions to fit the data well.
Example.
Let’s consider using linear regression in a simple example. I will assume that there is no Noise — the observed \( y\)-values are equal to \( \bar y({\bf x})\), and there will only be 12 points total in \( \mathcal X\), all with equal weight. Each expected value in the decomposition is, therefore, a simple average.
There are \( 2^{12} – 13 = 4083\) possible training sets, of all sizes,7 and the expected model function \( \bar h\) is given by the line that is the average of the LSR lines for these sample sets.8
I generated 12 points for the set \(\mathcal X\), which appear in the figure to the right. They were generated with the random
package in Numpy:9
the array of \( x\)-coordinates being sampled uniformly in the range \([0,4)\) and the \( y\)-coordinates given by a function that also used the random
package. More specifically, the coordinates were generated by calling the following:
x = 4 * np.random.random(size=12) y = (1.6 + np.exp(-(x-2)**2))*x + 1.1 + 0.6*np.random.normal()
Taking advantage of the itertools
package in Python, you can obtain all size m+2
subsets of the 12 indices of the points with
tuples = list(itertools.combinations(set(range(12)), m+2))
Calculate and store the slopes, intercepts, and mean squared errors of the LSR lines for all \( \binom{12}{m+2}\) training sets of size m+2
in a for
loop, and do so for each m
in range(11)
(so m+2
will go between 2 and 12).
for t in tuples: x_list = [x[t[i]] for i in range(m+2)] y_list = [y[t[i]] for i in range(m+2)] opt = np.polyfit(x_list, y_list, 1) yhat = opt[0]*np.array(x) + opt[1] slopes.append(opt[0]) intercepts.append(opt[1]) mses.append(metrics.mean_squared_error(yhat, y))
Since each single \( S\) is equally likely, the expected model function is the line with slope np.mean(slopes)
and with intercept np.mean(intercepts)
. This line is drawn in black below, the LSR line of all 12 points is in green, and a plot of the LSR lines for 1/8 of the subsets is to the right.
The Complexity is found by calculating the average variance (over x) between the models \( h_S\) and the expected model \( \bar h\). In this example, the Complexity is around 1.1. The MSE of the expected model — which is the Bias here as we assumed no Noise — is approximately 0.902. Adding these together you get the expected MSE of the various lines (up to rounding error). That is, by defining yhat
to be equal to the \( y\)-values for the expected model function, and s
to be sum of the variances between \( h_S\) and \( \bar h\), the output of the following code, on these points, was \( \approx 8.88\cdot 10^{-16}\):
metrics.mean_squared_error(yhat, y) + s/len(slopes) - np.mean(mses)
One can repeat the calculation, but for higher degree polynomials: you must keep track of more coefficients (\( d+1\) coefficients for a degree \( d\) polynomial), and change the degree in the line above where polyfit
was used.
With this example set of points, using quadratic polynomials, the Complexity came to approximately 117.4, the Bias to about 0.5. For cubic polynomials, the Complexity was approximately 12893 and the Bias was 2.353. The expected model functions are shown in red above.
- From one point of view, the choice of algorithm determines a hypothesis set \(\mathcal H\) implicitly. ↩
- A uniform distribution on the \(y\)-values could make perfect sense here. Remember we have said that the hypothesis class \(\mathcal H\) is made of constant functions, so how the position of \(y\) might depend on \(x\) is being ignored anyway. ↩
- Think about what \( S\) would need to look like to get the average of observed \(y\)-values to be barely below \( b\). While getting one such sample set has the same likelihood as any other one sample (with the same size), the lack of room for the points make the number of such sets much less than those that have \( y\)-values that are spread over the interval. ↩
- This can be done for other loss functions too; the MSE simply provides an example that is convenient and mathematically simple. ↩
- This section will be littered with notation of the form \( \mathbb E_*\lbrack\ \underline{\ }\ \rbrack\), the asterisk being a few subscripts. The notation stands for the expected value of what appears (the function) in the brackets. That is, it is the average value of that function, where the meaning of average is determined by the subscripts: what you average over.
Computationally, each is a center of mass calculation using the appropriate probability density function. Recall the pdf \( D_{\bf x}\) introduced in a previous post.
As an example, given \(\bf x\) and a function that predicts a \(y\)-value of \(h_S({\bf x})\), the expected value \(\mathbb E_{y}\lbrack(h_S({\bf x}) – y)^2\rbrack\) is equal to \(\int_{\mathcal Y}(h({\bf x}) – y)^2 D_{\bf x}(y)dy\). ↩ - The definition of Complexity may make it appear to be unrelated to the accuracy of the model on the data. However, the decomposition that was derived implies differently. ↩
- Since the system is under-determined if there is only one point (or none), the subsets of that size are ignored. ↩
- In general practice you, of course, cannot consider all possible training sets (even those near a given size). ↩
- As in previous posts, the abbreviation
np
has been made for the packagenumpy
. ↩