Guest author Rachel Gorenstein is writing for this post on k-Nearest Neighbors. She did an awesome job! 👏
k-Nearest Neighbors, abbreviated kNN, is an example of a supervised learning algorithm used for classification and regression. Stripped down to its most basic assumptions, the algorithm is akin to “majority rules” decision making. It operates by first memorizing a given training set of observations and their labels, and subsequently predicting the labels of new, unseen instances based on those of their “nearest neighbors.”
kNN Classification. Let’s say that we are given a set of observations , where the ‘s denote features (also called predictors or attributes) and the ‘s denote their respective labels (also called the target). The goal of kNN is to capture the relationship between the features and their labels in order to learn some function that can predict the corresponding given an unseen observation, . kNN tells us that the label of will be determined by the majority vote of the most similar instances to . Here, similarity is defined according to a distance metric.
Using random and shuffle from the NumPy library, you can generate a set of labeled observations. One such set is shown below:
In the figures the points in are the features and each color is a label. The unseen point (plotted in black) and it’s nearest three neighbors are circled. kNN concludes that it is safe to assume that the unseen point would be labeled green. This fact is confirmed using the kNeighborsClassifier from the sci-kit learn library in Python as below:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3)
X = list(zip(x,y))
c = list(colors)
knn.fit(X,c)
pred = knn.predict([[1,3.25]])
The process of kNN can be more formally described in a few steps. Given a set of training observations, , a positive integer , an unseen observation , and a distance metric :
Compute between and each to determine the nearest points to and call this set
For each , compute the fraction of points in with that label
The input should be assigned the label which comprises the highest fraction of
Note: choosing an odd will help to avoid ties.
When . Using Voronoi and voronoi_plot_2d from the SciPy library, you can plot the Voronoi tessellation of the plane1 such that each training point is the generating point in its own Voronoi region2 as in the figure.
Each region in the figure is labeled according to its center/generating point. Therefore, in the case of the label of an unseen observation will be determined according to which Voronoi region it falls in.
1.The partitioning of a plane with n points into regions such that each region contains exactly one generating point and every point in a given region is closer to its generating point than to any other. 2.What Are Voronoi Diagrams?
kNN Regression. In addition to the kNeighborsClassifier, the sci-kit learn library has the kNeighborsRegressor which can be used when the target variable (the label) is continuous in nature. The process by which the kNN regressor works is nearly identical to that of the kNN classifier (see section kNN Classification) except that the input will be assigned the label which is either the average or median of all ’s in , the set of training points that are the nearest points to .
We will use the two figures below to illustrate how the kNeighborsRegressor works.
The figure to the right is the plot of for 40 random values of . The
regressor can be fit to the data using varying values for (aka n_neighbors) as
below:
from sklearn.neighbors import KNeighborsRegressor
nnregress = KNeighborsRegressor(n_neighbors=k)
nnregress.fit(X, y)
Using this regressor, is predicted for 500 random values via nnregress.predict
and plotted in orange along with the original function (blue) in the figures below.
Applications. kNN is useful for a variety of classification problems in a variety of fields. The algorithm can be used to help determine how a potential voter may vote or even whether or not they will vote at all. The algorithm can be used by financial institutions to predict the credit rating of customers or by potential Airbnb hosts to determine how much to charge renters per night. Some more advanced applications include text classification, handwriting recognition, and even image or video detection.
Drawbacks of kNN. While the kNN algorithm certainly has its advantages, it also suffers as a result of the curse of dimensionality3. Using a small number of input variables, kNN operates with excellent efficiency. However, as the data set grows, the efficiency and speed of the algorithm declines. This also results in a high computational cost as the algorithm requires that the distance between each and each training observation be calculated.
:Rough draft: Once you have decided what architecture your neural network will have (number of layers, vertices and edges within each layer, and the type of activation function, among numerous other possible choices), and you have a data set for training, how do you actually train the network? That is, how do you determine the weights on the various edges? The short answer is that you use Stochastic Gradient Descent. However, for a neural network there is much more to SGD than in a single logistic regression model.
The backpropagation algorithm is a procedure for carrying out SGD on a neural network. I’ll explain it here. After that I will give an example of implementing a neural network.
The general idea of the backpropagation algorithm is clean and straightforward, but a complication arises in the implementation which muddies the idea of backpropagation when filling out details.
The General Idea. We want to compute partial derivatives of the loss function with respect to weights in the network for the purpose of stochastic gradient descent. First, compute the partials of the loss function with respect to the weights in the last layer. Then, since (mathematically speaking) the layers in neural networks are functions being composed together, the chain rule from calculus should allow us to reuse the partials that were already computed in order to more quickly find the partials with respect to weights in the second-to-last layer. After that, we can use the chain rule again to quickly find partials with respect to weights in the third-to-last layer, and so on in that manner, going backward through the network.
The reason that implementing this gets a bit complicated is due to the weights in the network not being functions that depend on the input data that gets passed through the network — the act of passing data through (“forward propagation”) is where the composition occurs. However, the weights do get used in forward propagation, allowing the idea to work, but with some more complicated details.
Recall the notation that was set up in the introductory post on Neural Networks. For simplicity, assume that the output layer \( V_L\) has one vertex. We’ll allow the activation to be any function \(a\), not only the sigmoid function. For the time being, consider just one sample point \( ({\bf x}, y)\) from the data set, \( y\) being the label.
Furthermore, say there is a current set of weights on the network (when initializing the network, use some random process to generate initial weights). Write \( {\bf W}\) for the current weights, thought of as a list of matrices \({\bf W}=\lbrack{\bf W}^{0}, {\bf W}^{1}, \ldots, {\bf W}^{L-1}\rbrack\), one for every two consecutive layers. The weight \( w^t_{i,j}\) in row \( i\) and column \( j\) of the matrix \( {\bf W^t}\) is the weight on the edge from \( v^t_j\), vertex \( j\) in layer \( V_t\), to \( v^{t+1}_i\), which is vertex \( i\) in layer \( V_{t+1}\).
Now, write \( {\bf O^t}\) for the vector of values taken by vertices in layer \( V_t\) when \( {\bf x}\) (a single sample datum) is the input to the network. That is, \({\bf O^t}\) is the forward propagated representation of \({\bf x}\). For example, \( {\bf O^0} = {\bf x}\). Using \({\bf W}_1^0\) for the first row of the matrix \({\bf W}^0\), we set \(z_1^1 = {\bf W}_1^0\cdot{\bf x}\), and then \(o^1_1 = a(z_1^1)\) is the first component of the vector \( {\bf O^1}\); the entire vector being given by \({\bf O}^1 = (a(z_1^1), a(z_2^1), \ldots) \).
We will simplify notation by allowing for vector input to the activation function \(a\), meaning that \(a\) is applied to each component of the vector. Then, the values in layer \(V_t\) of the network would be written \( {\bf O^t} = a({\bf z}^t) = a({\bf W^{t-1}}{\bf O^{t-1}})\). 1
The backpropagation computation occurs a layer at a time. Recall that the sample datum \({\bf x}\) has been fixed. Assume that we have computed all its forward propagated representations \({\bf O^t}, t=1,\ldots,L\). By restricting to partial derivatives of \(\ell\) with respect to weights before a given layer, we can think of having a loss function for each layer (except the input layer). That is, for each \(t=0,1,\ldots,L-1\) let \(\ell_{t+1}\) be the loss (on \({\bf x}\)) of the network, which has input variables being the weights \( w^t_{i,j}\), with \(i=1,\ldots,k_{t+1}\) and \(j=1,\ldots,k_t\).
Now, we have that \( \ell_t\) can be understood from \( \ell_{t+1}\) by thinking of function composition. More precisely, \( \ell_t({\bf O^t}) = \ell_{t+1}({\bf O^{t+1}}) \), which means that values of \(\ell_t\) are given by \(\ell_{t+1}(a({\bf W^{t}}{\bf O^{t}})) \), and \({\bf O^t} = a({\bf W^{t-1}}{\bf O^{t-1}})\). For a given \(i\) with \(1\le i\le k_t\), write \(o^t_i\) for component \(i\) of \({\bf O^t}\). Note that \(o^t_i = a(z^t_i) \) and that \( z^t_i = {\bf W}^{t-1}_i{\bf O^{t-1}} \).
By the chain rule, \(\dfrac{\partial\ell_t}{\partial w^{t-1}_{i,j}} = \dfrac{\partial\ell_t}{\partial o^t_i}\dfrac{\partial o^t_i}{\partial z^t_i}\dfrac{\partial z^t_i}{\partial w^{t-1}_{i,j}} = \dfrac{\partial\ell_t}{\partial o^t_i} a'(z^t_i) o^{t-1}_j \). This means that we can quickly get this partial derivative if we have also already calculated \(\dfrac{\partial\ell_t}{\partial o^t_i}\).
Now, say that \(t < L\). Then note that for each \(k = 1,2,\ldots,k_{t+1}\), we have that \(z^{t+1}_k = {\bf W}^t_k{\bf O^t} = \sum_{i’} w^t_{k,i’}o^t_{i’}\). Hence, the partial of \(z^{t+1}_k\) with respect to \(o^t_i\) is going to be \(w^t_{k,i}\). Recall that \(\ell_t({\bf O^t}) = \ell_{t+1}(a({\bf W^{t}}{\bf O^{t}})) \), which gives us
This reduces the calculation of \(\dfrac{\partial\ell_t}{\partial o^t_i}\) to having calculated \(\dfrac{\partial\ell_{t+1}}{\partial o^{t+1}_k}\) for each \(k\), which was done previously as we backpropagate through the network. In the interest of implementing the calculation in a “vectorized” way: let \(\nabla_{\bf o}\ell_{t+1}\) be the row vector of partials of \(\ell_{t+1}\) with respect to components of \({\bf O}^{t+1}\), let \(\Delta(a'(z^{t+1}))\) be the diagonal matrix with values \(a'(z^{t+1}_k)\) on the diagonal, and let \({\bf W}^t_{\ ,i}\) be column \(i\) of the matrix \({\bf W^t}\) (as a column vector). Then the above equation tells us that \(\dfrac{\partial\ell_t}{\partial o^t_i} = \nabla_{\bf o}\ell_{t+1}\Delta(a'(z^{t+1})){\bf W}^t_{\ ,i}\).
The first computed partial derivative \(\dfrac{\partial\ell_L}{\partial o^L_1}\), from the output layer, will have a known form that depends on the loss function being used. For example, if the activation function in the output layer is \(\sigma\) and the negative log-loss function is being used, then in our current notation we would have \(\ell_L(o^L_1) = – (\overset{\circ}{y}\log(o^L_1) + (1 – \overset{\circ}{y})\log(1 – o^L_1) )\). As a result, we would have that
These computations were all done on a single data point and label \( ({\bf x}, y)\). You work with a subset of the training set (called a mini-batch in neural networks) and average the partial derivatives you get for each point in the mini-batch. You then subtract \( \eta\) times that average from the current \( {\bf W}\).
In addition, you often repeat this for a set of mini-batches that cover the entire training set. After you go through the whole training set, this is referred to as an epoch of training.
There is code below that implements backpropagation, followed by some code that trains a neural network.
–footnote– the log-loss is what we are using. Explanation of how to alter it from the way it was written for single logistic regression.
–footnote– the subscript of \( \nabla\) means: evaluate at that point.
To test the code on a basic example, I begin by generating the labeled data set in the plane shown at the right. This is the training set; it contains 200 points. The red points (also the blue points) in the data set are grouped quite closely to each other. Clearly this set is linearly separable; one should be able to get high accuracy from a single neuron. However, I would like to try something with a little more structure, so I set up a neural network with layers \( V_0, V_1,\) and \( V_2\), where \( V_0\) and \( V_1\) have 3 vertices each, and the output layer \( V_2\) has a single vertex. As described in the initial discussion on neural networks, the values the vertices in \( V_0\) take are the first and second coordinate of the input point, and 1 for the third vertex. The sigmoid activation function uses \( k=2\).
The initial 12 weights (\( 3\times 3\) between layers \( V_0\) and \( V_1\) and 3 between \( V_1\) and \( V_2\)) are each obtained by sampling from a uniform distribution on the interval \( (-1.5, 1.5)\). I ran the backpropagation, using mini-batches of size 10, for 46 epochs. The learning rate of each epoch was either 1 or 2 (alternating roughly every 10 epochs at first, but remaining 2 for the last 20 epochs). After 46 training epochs, the network had 100% accuracy on the training set (it correctly labeled each training point). The figure on the left above shows how this network labels a points near the origin.
I should note that the resulting network (and how it labels points in the plane) is quite dependent on what the initial weights were (it can also be affected somewhat by which points end up in each mini-batch). At a certain point during the training, the set of points are very likely to be near where a local minimum of the loss occurs, and so that it stays near that local minimum. However, there are almost certainly many local minima of the loss as a function of the weights. The particular one you settle near can dramatically affect how many epochs are required to obtain high accuracy. For example, on another training session, using the same training data, and where every other choice made was identical to the first, I obtained a network with 100% accuracy after only 26 epochs. Its prediction on points in the plane is shown on the right above.
In a second example, I will train a neural network on the training data shown to the right. This example is not a linearly separable set. The intention is that positively points (blue) should be contained in a bounded set around the origin (including points other than the sampled training points). First, a neural network that will approximate this intended task must have more vertices in a hidden layer than the vertices in the input layer. In fact, by considering the bounded positive region to be points that are positively labeled in all of four different logistic regression models, one can know how to explicitly construct a neural network (with weights) that will achieve what is desired. Make the hidden layer have 4+1 neurons (one for each of four logistic regressions that the central points are positive for, and one other which have a value that is important for the final layer); make one output neuron, the 5 weights being such that you approximate the AND function on the first four neurons. The weights to achieve this are below and a network with weights near those will classify points as in the figure to the right.
But this is certainly cheating. In real applications you won’t know how to set the weights yourself, but will need to use backpropagation to find good weights. Because of the tendency of the neural networks to fall into local minimum wells, the actual values of the weights initially can have significant influence on the final network. So what is a way to let the network find the right spot through training?
Roughly speaking, don’t always use the last set of weights from gradient descent for the weights in the next epoch. Instead, find the “best” \( {\bf W}\) among those encountered during that epoch and sometimes use that for the start \( {\bf W}\) of the next epoch. By “best,” I mean select the \( {\bf W}\) that has minimum training loss. Since during a “best” search you want to be bouncing around to somewhat distant locations, use a large learning rate.
In the example, I began with 1 epoch, with \( \eta = 5\), and I took the best \( {\bf W}\) after the epoch. I followed this by 10 epochs, with \( \eta = 2\), where the last \( {\bf W}\) was selected; then 30 epochs with \( \eta = 1\), where the best was selected (the hope here being that the network has made it near a good local minimum (a deep “well”), so the gradient might be rather short, but that well could still have other local minima within it — so I am trying to avoid those); then 5 epochs where I average the \( {\bf W}\) vectors; then 10 epochs, with learning rate \( \eta = 0.5\), and taking the last vector. The resulting network labels points as shown. It only mislabels 1 of the 160 training points. Its approximate weights are displayed below.
Note: the expression \({\bf W^{t-1}}{\bf O^{t-1}\) is the result of multiplying a \(k_t\times k_{t-1}\) matrix by a vector with \(k_{t-1}\) components, giving the \(k_t\) component vector \({\bf z}^t\). ↩
As the title implies, the goal of this post is to describe how to use Stochastic Gradient Descent (SGD) for a logistic regression model. Logistic regression models are central to (artificial) neural networks, our next topic (central to those that use “sigmoid neurons” at least). With SGD for logistic regression, you can find a hyperplane that does a reasonably good job (not perfect, of course) on the binary classification problem, as long as the data is “close enough” to being linearly separable.
Differentiable loss function for binary classification. When using SGD to fit a model function to data, the empirical risk does not work as a loss function (see the Empirical Risk Minimization section in this post). A differentiable function is needed.1 Such a function was introduced in the Loss function section of the logistic regression post, it is called the log-loss function. Recall, correctly labeled points, especially if far from the hyperplane, contribute close to zero to the log-loss.
A hyperplane for a logistic regression model is determined by parameters \( \tilde{\bf w}=(w_1,\ldots,w_d,w_{d+1})\) (think of \(w_{d+1}\) as the shift parameter \( b\)). Set \( \tilde{\bf x}\) to be the \((d+1)\)-vector whose first \(d\) components are those of a data point \(\bf x\), and so the last component is the number 1. With these definitions, \(\tilde{\bf w}\cdot\tilde{\bf x} = {\bf w}\cdot{\bf x} + b\), where \({\bf w} = (w_1,\ldots,w_d)\) is the normal vector to the hyperplane.
Recall that \( \sigma_k\) is the sigmoid function \( \sigma_k(t) = (1+e^{-kt})^{-1}\), which has values between 0 and 1. Given a constant \( k>0\) and parameter vector \(\tilde{\bf w}\), the logistic regression model is defined by \(h({\bf x}) = \sigma_k(\tilde{\bf w}\cdot\tilde{\bf x})\). Recall that \(h({\bf x})\) was interpreted as the probability that the label for \( {\bf x}\) is +1.
Given a training data set \(S = [{\bf x}_1,\ldots,{\bf x}_m]^\top\), say that each \({\bf x}_i\) in the training set has correct label \(y_i\). Finally, recall the log-loss function \( \ell_S\) which measures the loss (error) of \(h\) by:
Note that the input to \( \ell_S\) is the function \(h\), not a point \( {\bf x}\). As \(h\) is determined by the parameters \(\tilde{\bf w} = (w_1,\ldots,w_d,w_{d+1})\), we may view \(\ell_S\) as a function of variables \( (w_1,w_2,\ldots,w_d, w_{d+1})\).2 These are the variables with respect to which the gradient needs to be calculated.
Now, let’s discuss one step in SGD using the log-loss function on a small set \(S\), having five labeled points in \(\mathbb R^2\). A plot of the labeled points appears to the right, with the dashed line corresponding to an initial vector \( \tilde{\bf w}_0 = (0, 1, -3)\).3 Use \( k=1\) for this example.
Recall that, on step \(t\), to update the vector \(\tilde{\bf w}_t\), you subtract \(\eta\) times the gradient of the loss function (in this case, \(\eta\nabla\ell_S\)), evaluated at \(\tilde{\bf w}_t\). After the discussion below, you will be able to check that the gradient \( \nabla \ell_S\), evaluated at \( \tilde{\bf w}_0 = (0, 1, -3)\), is approximately \( (0.60, 1.04, 0.30)\). Using a learning rate \( \eta = 0.2\), one update to the \( \tilde{\bf w}_0\) vector gives a vector corresponding to the solid line:
If you alter the data set \(S\), say by including an additional positive point at (2, 3.5) as shown on the right, then the function itself \( \ell_S\) is different than before, as there is an additional point over which you sum. For this new \(S\) (and the same initial \( \tilde{\bf w}_0\)) the gradient \( \nabla \ell_S\) evaluated at \(\tilde{\bf w}_0 = (0,1,-3)\) is approximately \( (0.37, 0.65, 0.19)\). Using the same learning rate, the updated vector \(\tilde{\bf w}_1\) corresponds to the solid line on the right. Note that the extra positive point shortened the gradient vector. Even though the two negative points were in the same (wrong) position on the positive side of the initial line, if the updated line were to move across them the new blue point would cause an increase in the loss that did not exist before we added a point to \(S\). The update is trying to balance that, settling on something in between.
The gradient of \( \ell_S\). 🐉🐉 (Derivatives, and the chain rule in this section.) Calculating the gradient requires finding each partial derivative \( \frac{\partial \ell_S}{\partial w_j}\), for \( j=1,2,\ldots,d+1\). I’ll begin by computing the partial derivative of just \( \ln\left(1+e^{-k\tilde{\bf w}\cdot\tilde{\bf x}_i}\right)\), where \( \tilde{\bf x}_i\) is one of the points in \(S\) (augmented with 1 in the \(d+1\) spot). Here is that computation:
First, suppose that \(\tilde{\bf x} = (x_1,\ldots,x_d,1)\). Then
so we have \( \frac{\partial}{\partial w_j}(-k\tilde{\bf w}\cdot\tilde{\bf x}) = -kx_j\), the \( j^{th}\) coordinate of \( \tilde{\bf x}\) times \( -k\).
Now, for \(\tilde{\bf x}_i\in S\), write \( x_{i,j}\) for the \( j^{th}\) coordinate of \(\tilde{\bf x}_i\). By using the chain rule multiple times we have
Using a bit of algebra gives \( 1 – \sigma_k(A) = \frac{e^{-kA}}{1+e^{-kA}}\), where \( A\) could be replaced by any quantity. So the above computation implies that
Looking at the definition of \( \ell_S\), the label of \( {\bf x}\) is also in play: the exponent is \( -ky_i\tilde{\bf w}\cdot\tilde{\bf x}_i\) not just \( -k\tilde{\bf w}\cdot\tilde{\bf x}_i\). Of course, there is no difference when \( y_i = 1\).
I’ll use some notation that appeared in the post on logistic regression to write \((\dagger)\) concisely, regardless of \( y_i\). Set \( \overset{\circ}{y_i} = (y_i+1)/2\) and note that \( y_i = 1\) makes \( \overset{\circ}{y_i} = 1.\) With that in mind, if \( y_i = 1\) equation \( (\dagger)\) can be restated as
The same equation holds when \( y_i = -1\). To verify this you need a bit more algebra. In particular, notice that by both multiplying and dividing by \( e^{-k\tilde{\bf w}\cdot\tilde{\bf x}_i}\) you get that
Since having \( y_i = -1\) essentially “removes negative signs in front of \( k\)” in the computations of (\( \ast\)), this last bit of algebra tells us that
and that equation says the same thing as (\( \ddagger\)) when \( y_i = -1\) since, in that case, \( \overset{\circ}{y_i} =0\).
Using the properties of derivatives (that is, how they work with sums and multiples of functions), the \( j^{th}\) component of the gradient has been computed: The \( j^{th}\) component of \( \nabla \ell_S\).
for any \( j=1,\ldots,d+1\) (when \( j=d+1\) then \( x_{i,j} = 1\)). For \( j=1\), this is the first entry in the vector \( \nabla\ell_S\), for \( j=2\) it is the second entry, etc.
⏚
Using the gradient in SGD. Having determined how to calculate the gradient vector of \( \ell_S\) for an arbitrary set of points \( S\), how does one carry out SGD for logistic regression on a model \(h({\bf x}) = \sigma_k(\tilde{\bf w}\cdot\tilde{\bf x})\), and with some set of training data? Initialize \( \tilde{\bf w} = \tilde{\bf w}_{(0)}\) as some vector (it really can be any vector, though if you have an approximate guess for a good one it is useful to start there). Now, I will use \(S\) to denote the set of all training data, as before. However, note that during SGD we will not be calculating the gradient of \(\ell_S\), but of \(\ell_{S_t}\) where \(S_t\) is a subset of \(S\).4 The formula above still holds, except that you simply take the sum over points that are in \(S_t\), and you replace \(m\) with the size of \(S_t\). In the setting of training neural networks, the subset \(S_t\) is called a minibatch.
Decide to go through some number of rounds, say \( T\) rounds. On the \( t^{th}\) round, \( 0\le t < T\), a process selects a subset \( S_t \subset S\) from the available data and \( \tilde{\bf w}_{(t)}\) is updated by5
If choices that work well have been made along the way, including how many rounds to go through, then the final vector \( \tilde{\bf w}_{(T)}\) has coefficients of a hyperplane that give a logistic regression model with close to minimal loss. However, this is not always assured, especially if you extend to some combination of logistic regression models (as will be done in neural networks). Sometimes, because of that uncertainty, a different vector than \( \tilde{\bf w}_{(T)}\) is used. For example, the “best performing” (lowest loss) vector among \( \tilde{\bf w}_{(0)}, \tilde{\bf w}_{(1)}, \ldots, \tilde{\bf w}_{(T)}\) or, alternatively, the average of the \( \tilde{\bf w}\)-vectors.
Also, it is often desirable (though not necessarily possible) to have a convex loss function, as it helps with convergence (helps SGD give “good” answers). ↩
I will keep the value of \( k\) fixed. So it will not change during the gradient descent. ↩
Meaning, it corresponds to the line \( w_1x_1+w_2x_2+b=0\), which in this case is \( x_2 – 3 = 0\). ↩
Recall (from the earlier post) that we select a subset of the training set at each step in the SGD procedure. ↩
When computing \( \nabla(\ell_{S_t})\) in this equation, \( \tilde{\bf w}_{(t)}\) should be used for the \( \tilde{\bf w}\) vector appearing in the boxed formula of the previous section. ↩
In this post I will introduce a new loss minimization approach that is iterative and lets the algorithm “learn” the parameters of a good model. The starting place is the gradient of a function from calculus.
The gradient. Recall that if \( f:\mathbb R^{d+1} \to \mathbb R\) is a function, then the gradient of \( f\) is the vector of its partial derivatives, 1
One of the two most important facts about the gradient that you learned in calculus was the following.
🐉🐉 (Some multi-variable calculus appears in this part. You may want to review it!) Theorem.2 At a point \( {\bf a}=(a_1,\ldots,a_d, a_{d+1})\) in the domain of \( \nabla f\), the values of \( f\) decrease most rapidly in the direction of \( -\nabla f({\bf a})\).
Proof. To find the direction in \( \mathbb R^{d+1}\) so that changing the input in that direction decreases the values of \( f\) most rapidly is the same as finding the length 1 vector \( {\bf u}\) such that the directional derivative \( D_{\bf u}f({\bf a})\) has its lowest value. Recall that
which means that, for any \( {\bf u}\), if \( \theta\) is the angle between \( {\bf u}\) and \( \nabla f({\bf a})\) then \( D_{\bf u}f({\bf a}) = |\nabla f({\bf a})|\cos\theta\ge -|\nabla f({\bf a})|\), since \( \cos\theta\ge -1\). If you set \( {\bf v} = -\frac{\nabla f({\bf a})}{|\nabla f({\bf a})|}\) then by calculating \( {\bf v} \cdot \nabla f({\bf a})\) you see that \( D_{\bf v}f({\bf a}) = -|\nabla f({\bf a})|\). So the direction of \( -\nabla f({\bf a})\) gives the minimal value of the directional derivative. \( \square\)
⏚
Gradient descent. Say you have a model function for a data set and you would like to change the model to improve it, in the sense of decreasing its error. If you can understand the gradient of the loss function, then the above theorem presents a natural path to take:
The direction of the negative gradient is the direction of greatest decrease, so alter the model function a little bit in the direction of the negative gradient.
The amount that is “a little bit” must be chosen. Notice that I said to alter the model function in the negative gradient direction. Since the partial derivatives are with respect to model parameters, the gradient direction is in the space of parameters, and you are changing the model’s parameters a bit (as directed by the gradient). In gradient descent, you iterate this multiple times, getting closer to a local minimum each time.3
Note that the gradient is the gradient of the loss function, not of the function that is determining the predictions. It is the model function’s parameters that you alter, based on this gradient.
The following two examples illustrate why you must be careful to take small steps in the direction of the negative gradient. One does this by using a multiple \( \eta\) of the gradient. In the first example, the toy example of a loss function has only one local minimum. Even in that setting, changing \( \eta\) by only 0.1 causes the sequence of values to diverge from the minimum instead of converging to it. (The numbers 1–8 are indicating which iteration, i.e. how many times the gradient was found, and then \(\eta\) times that gradient was subtracted from the input.)
In the second example, the function has more than one local minimum. If the steps in the direction of \( -\nabla f\) are too large, as in the figure below on the right, then the descent can “escape” from one of the locations near a local minimum and end up at a different one, one that could have much larger nearby values. (Images can be enlarged by clicking on them.)
In machine learning, particularly when this is used in neural networks, the subtracted multiple of the gradient, \( \eta\), is called the learning rate. Choosing what value \( \eta\) should have is somewhat elusive. As demonstrated in the above examples, it can have a dramatic effect on the algorithm’s convergence to a model that approximately minimizes loss.4
(Example) Gradient descent for linear regression.5 Suppose that \( \mathcal D\) is the underlying distribution (in the world) that governs two quantities that are being modeled by simple linear regression — for example, say that you are using linear regression to model the price of office space compared to its square footage. The linear models are determined by a slope \( w\) and intercept \( b\), and the squared error of a model \( (w, b)\) has an expected value \( \mathbb E_{\mathcal D}[(wx+b – y)^2]\), where \( (x,y)\) is drawn from \( \mathcal D\), \( x\) being square footage and \( y\) being the observed price. Write \({\bf w} = (w, b)\) and use \( mse_{\mathcal D}({\bf w})\) for this loss function (the generalization error6). Write \( \nabla mse_{\mathcal D}({\bf w})\) for the gradient. In step \( t\) of the gradient descent process, if \( {\bf w}_t\) is (the parameters of) the current model, the next model is
However, direct knowledge of \( {\mathcal D}\) is not available, so an alternate approach must be used as \( \nabla mse_{\mathcal D}\) cannot be computed. Stochastic Gradient Descent is the alternate method. It provides a process to (approximately) carry out gradient descent in practice, given a training set.
Stochastic Gradient Descent (SGD). The idea of stochastic gradient descent (SGD), in the setting of data analysis/machine learning, is to replace (in step \( t\)) the gradient of the true loss function with the gradient of an empirical loss function (on some of the available data). In other words, use a random process to select a subset \( S^t \subset \mathcal X^{train}\) of the training set. Then, defining \( {\bf v^t} = \nabla \text{MSE}_{S^t}({\bf w}_t)\) (or, equal to the gradient of whichever loss function on \(S^t\) that you are using), replace (\( \dagger\)) with
Note that \( {\bf v^t}\) is a random variable, depending on \(S^t\). The process used to find \( S^t\) should be such that the expected value of \( {\bf v^t}\) is close to \( \nabla mse_{\mathcal D}({\bf w}_t)\), provided \( S\) is a sufficiently representative sample of data drawn from \( \mathcal D\).
Using random, from the NumPy library (as usual, numpy is abbreviated as np), you can generate the desired subset \( S^t\) in the following way.7
def find_St(data_points, labels, St_size):
m = len(data_points)
# selects St_size indices for points, in range(0,m)
index_list = np.random.choice(m, size=St_size)
return data_points[index_list], labels[index_list]
Using this, SGD can easily be implemented for linear regression with the mean squared error on \( S^t\) as the loss function (call that function \( \text{MSE}_{S^t}\)). The goal is to determine an appropriate slope and intercept \( {\bf w} = (w, b)\), so view these as the input to the function. This means that the gradient uses partial derivatives with respect to \( w\) and \( b\). These partial derivatives are
\(\displaystyle \frac{\partial}{\partial w}\text{MSE}_{S^t}(w, b) = \frac2{m_t}\sum_{i=1}^{m_t} (wx_i + b – y_i)x_i\)
\(\displaystyle \frac{\partial}{\partial b}\text{MSE}_{S^t}(w, b) = \frac2{m_t}\sum_{i=1}^{m_t} (wx_i + b – y_i)\),
where each \( x_i\), with label \( y_i\), is in \( S^t\) and there are \( m_t\) points in \( S^t\). Put these together to make the vector \({\bf v^t} = \nabla \text{ MSE}_{S^t}({\bf w}_t)\), where you substitute \(w\) and \(b\) in the expressions for their value at step \(t\). Starting with any old line, with slope \( w_0\) and intercept \( b_0\), a (small enough) step in the direction of \( {\bf v^t}\) will tend to (that is, on average) take the slope,intercept parameters closer to \( (\hat w, \hat b)\), the slope and intercept of the LSR line.
For an example of implementing this: the set of 150 points displayed below-left are taken from the line \( y = 4x + 2\), with some noise. I proceeded through only 15 steps of SGD, starting with \( (w_0,b_0) = (0, 0)\) and using the find_St function with the size of \( S_t\) equal to 30 on each step. With the \( S_t\) found, I determined \( {\bf v^t}\) as described above, and used learning rate \( \eta = 0.1\). The lines for all 16 (slope, intercept) pairs (including the initial one) are plotted on the right below. The color of the line goes from blue to red as \(t\) increases. It appears to be “settling” towards the line \( y = 4x + 2\). The line in the last step is approximately \( y = 4.1x + 1.58\).
For the end goal here, I will have functions that depend on a set of parameters (which, in turn, determine a ML model function), and I will need to take partial derivatives with respect to those parameters. In order to highlight this, I will use \({\bf w} = (w_1,\ldots,w_d, w_{d+1})\) for the input of the function. You could think of \(w_{d+1}\) as the “shift” parameter \(b\). ↩
This result requires that the partial derivatives of \(f\) be continuous at \({\bf w} = {\bf a}\). ↩
As you get closer to a local minimum, the gradient gets closer to zero, and you move by less and less each step. ↩
For many loss functions, one can prove that the gradient descent process converges for an appropriate choice of \( \eta\). For example, this is known for differentiable convex functions (and even something a bit more general), but we won’t get into that. Importantly, there are cases in ML where, on the face, it seems the loss function is one where convergence is guaranteed, but it is not! Great care is needed… ↩
Gradient descent is useful in many settings, linear regression is only an example. In the next post, we will discuss using a variant of it for logistic regression when the data is not linearly separable. This will lay the foundation of a neural network. ↩
The generalization MSE of \( wx + b\) is the average value of \( (wx + b – y)^2\) over all \( (x,y)\) (all office space sales), not the average over just the data that is available. ↩
The inputs data_points and labels should both be NumPy arrays. The generation of index_list uses sampling with replacement, which can be changed easily (see the NumPy documentation). ↩
Decision stumps are very basic: one decision stump can only separate data into +1 and -1 labels based on a single coordinate being larger or smaller than one number.1 The general idea of boosting is to take some basic models, given by a weighted version of the ERM rule in their class, and create a new model function from them with a linear combination — a sum of the basic models, with each one multiplied by a number (the number being chosen in a certain helpful way). The linear combination (the new model) is often a better model for the data. AdaBoost2 is a particular implementation of this idea.
1.Recall from the last post how a decision stump determines \( \pm 1\) labels, always splitting the instance(data) points by a hyperplane that is perpendicular to the \( x_j\)-axis. We used the notation \( \mathcal B^d\) for decision stumps that worked on data points with \( d\) coordinates. 2.(Schapire and Freund 1995). Short for “Adaptive Boosting,” the algorithm adapts to how well it is performing (learns) as it goes along.
AdaBoost. As input the AdaBoost algorithm takes a number of rounds \( T\) and a training set \( S\) (consisting of data points \( {\bf x_i}\) and labels \( y_i = \pm1\), for \( i=1,\ldots,m\)). When the algorithm ends, the number of basic models used in the linear combination will be the number of rounds. For \( t\) between 1 and \( T\), if \( h_t\) is the basic model for that round, it will have a coefficient \( w_t\) and the new model will be given by
The algorithm. I will explain the algorithm using decision stumps for the basic models. However, any hypothesis class could be used in its place. In round 1 the algorithm has uniform weights in \( {\bf p}^{(1)}\): so the vector’s coordinates are \( p^{(1)}_i = 1/m\), for all \( i\).
Roundtof the AdaBoost algorithm:
⁍ Obtain basic model \( h_t\) from the ERM rule on \( \mathcal B^d\), using weights \( {\bf p}^{(t)}\).
⁍ Compute the weighted empirical risk3: \( \varepsilon_t = L_S(h_t, {\bf p}^{(t)})\)
⁍ Set \( w_t = \ln\left(\frac1{\varepsilon_t} – 1\right)\). 4
⁍ Update \( {\bf p}^{(t)}\) as follows: \( {\displaystyle p^{(t+1)}_i = \frac{p^{(t)}_i \text{exp}({-w_ty_ih_t({\bf x_i})})}{\sum_{k=1}^m p^{(t)}_k \text{exp}({-w_ty_kh_t({\bf x_k})})} }.\)
These are the coordinates of \( {\bf p}^{(t+1)}\) for round \( {\bf t+1}\).
Return: the list of weights and models, \( (w_t, h_t),\ t = 1,\ldots, T\).
To note a few things, the denominator in the update to \( {\bf p}\) equals the sum of the numerators of different \( p_i\)’s. This makes \( {\bf p}^{(t+1)}\) a probability vector. Also, if \( h_t({\bf x_i}) = y_i\) then (since \( w_t > 0\)) the power of e is negative, making \( p_i\) smaller, for that \( i\). If \( h_t({\bf x_i}) = -y_i\) then the opposite occurs and \( p_i\) becomes larger. So the new probability vector has its weight shifted towards the data points that were previously misclassified.
3.For us, \( h_t\) is the decision stump, with some \( \theta, j,\) and \( b\), that minimizes the expression \( F_j(b)\) from the last post, using the entries of \( {\bf p}^{(t)}\) for the weights. The risk \( \varepsilon_t\) is the value of that minimal \( F_j(b)\). 4.Note that \( w_t\) is larger when \( \varepsilon_t\) is smaller. Also, there is an assumption that \( 0 < \varepsilon_t < 1/2\). If it were that the error \( \varepsilon_t = 0\), then simply use the stump \( h_t\) for the model. The assumption \( \varepsilon_t < 1/2\) makes it so that \( w_t > 0\), and comes from assuming that a weak learner is giving the basic model functions.
AdaBoost in Python. Below we provide a block of Python code that will carry out the AdaBoost algorithm.
Some auxiliary functions are needed.
First, erm_ds is the efficient implementation of the ERM rule in \( \mathcal B^d\), with given weights \( {\bf p}\), that was discussed last post. The function stump is the following: I wrote the sign function so it returns -1 if its input is negative, and returns 1 otherwise, including when the input is 0. Finally, a function L_S is defined to compute the weighted loss of a decision stump: There are three inputs to L_S: a triple of parameters f that determines a decision stump; a list S which has two entries, S[0] is the list of data points for training and S[1] is the corresponding list of labels; and a list P that is the probability vector. Think about how the labels are \( \pm1\), and the stump gives the model’s answer at a point. This means the for loop goes through the data points and adds 0 to loss if the stump is right, and adds P[i] to loss otherwise.
With these preliminaries, the following code implements AdaBoost:
Example. Consider the artificially generated training set of points in the plane, displayed on the right5. This data set cannot be separated with one decision stump. The one with minimal loss, using uniform weights, has \( j=2, b\approx 0.78\). it mislabels 8 blue points that have 2nd coordinate less than 0.78. In this section, boosting stumps will give a model for the data set.
To begin, if a person had been given this data, and were asked to draw splitting hyperplanes for a set of stumps that would separate the data (since we are in \( \mathbb R^2\), that means vert. or horiz. lines), they might draw something like the following.
At first glance there appear to be 4 decision stumps involved (for the four linear parts of the green boundary), but the boosted model doesn’t work out quite so simply6. The first 4 decision stumps give a decision boundary that has only three linear parts. After 6 rounds we get five parts to the decision boundary. A depiciton of how that boosted model with 6 stumps labels points is shown in the next figure.
To be clear, I estimated the gray and green lines myself (I actually drew it on my iPad), with no reference to the algorithm’s output. The coloring of the points on the right comes from the model that AdaBoost produces after 6 rounds. The training error of that model is approximately 0.097 (meaning it is right on more than 90% of the training data – note there are just 3 blue points on the top-left being mislabeled).
If we wanted to try to capture those extra points, we could do an extra round in AdaBoost. The result is shown in the next figure.
While the result after 7 rounds looks qualitatively more like our initial guess, a more careful look shows that it is mislabeling more training points than the previous one; the training error here is approximately 0.129. Recall, the training set was fairly small and so AdaBoost doesn’t have much to work with at this point. However, one lesson we can take is that it is possible for training error to increase with more rounds of AdaBoost.
5.Disclaimer: The number of data points for this example is admittedly low for the complexity of the boosting procedure. 6.In fact, assuming different weights, you need at least 3 decision stumps in the linear combination just to get a result that is different than one decision stump alone. Such is the reality of the sign function.
Second example — finding a high contrast edge. This example will use AdaBoost on decision stumps to complete a task on images. The images in the data set are grayscale images – each pixel has a value from 0 to 255 (inclusive), the pixel being brighter when closer to 255.
Given an image, the task will be to determine if (somewhere in the image) there is a roughly horizontal edge with very bright pixels above it and dark pixels below it. Call this a skyline curve. To the right is an image with a very strong skyline curve appearing at the top of the dog’s head. So the goal is to decide if there is a skyline curve that stands out.7
The image files come from an old Dogs vs. Cats challenge at the website kaggle.com. The original competition was to make an algorithm that identifies whether each picture is of a cat or dog.
The original training set has 25,000 images, each labeled ‘cat’ or ‘dog.’ Since our task is different than the original challenge, we will have to determine our own labels. Due to time constraints, this will require the use of a smaller training set.8
Making vectors out of the images. There are many ways to make a vector from a grayscale image. A common way is to treat each pixel value as a coordinate and order these in some way. For example, if the image is \( m\) pixels by \( n\) pixels, then you could make the first \( n\) coordinates be the values from the first row of pixels, then go to the \( n\) pixels in the next row, etc. In the end you would have a vector in \( \mathbb R^{mn}\).
The above process for getting a vector is not well-suited to this task, for a few reasons.9, 10, 11 Instead, I’ll adapt an idea for image recognition tasks that uses rectangles, and certain sums of the pixel values in those rectangles, to create coordinates for the vector.12
More precisely, divide the height (in pixels) of an image by 15 and call the result \( \alpha\). Divide the width (in pixels) of an image by 75 and call the result \( \beta\).13 For an \( \alpha\times \beta\) rectangle, let’s call it r, calculate the sum of the pixel values of the top-half of r and then subtract the sum of the pixel values of the bottom-half of r. This gives a number, call it tb(r), the top-bottom score of r.
For example, if r is the \( 6\times 3\) rectangle with pixel values shown to the right, then tb(r) = 1739. Remember, the task is to find a skyline curve. If a skyline curve passes through the middle of r, then tb(r) should be very large relative to the scores of other rectangles (high values on top, minus close-to-zero values on the bottom).
The idea is to focus on when tb(r) is large; however, the particular tb(r) scores don’t matter a great deal. Rather it matters where the rectangles with highest score are in relation to each other. If a “roughly horizontal” edge has decent length, there would be adjacent rectangles with high tb(r) along that edge.
The vector: Take the 10 rectangles14 that have the largest value of tb(r), call them r1, r2, …, r10. For each ri, measure how far it is from the others (among those 10) in some way. Call that measure m(ri). Order these measures in ascending order, and the vector will be the ascending list of those measures. So each vector will have coordinates that do not decrease to the right, each coordinate being some m(ri).16 This produces a vector in \( \mathbb R^{10}\) for each image. To the right is one of the images in the data set, the 10 rectangles with highest score being drawn in red. Its vector: approximately (1.7, 1.9, 1.9, 2.4, 2.9, 3.1, 3.1, 3.8, 7.2, 33.5).
With a labeled subset of the images (a person having decided if there is a skyline curve), use part of the labeled data for training, and the other part to test the resulting model. The training is done on the images’ vectors in \( \mathbb R^{10}\), by applying AdaBoost with decision stumps in \( \mathcal B^{10}\) as the base models.
I generated training and test sets for the Dogs vs. Cats images, by using numpy.random.choice to randomly select a file. I would store which one it was. Then I displayed it using matplotlib.pyplot and would visually decide about a strong skyline curve and, accordingly, add a 1 or -1 to a list of labels. Having to do this semi-manually necessitated a rather small training set (80 images) and test set (20 images).
The results were okay, considering the training data! I iterated through 25 rounds, recording the training error of the AdaBoost output for each (honest empirical risk, not weighted by \( {\bf p}\)). To the right is the graph of the training errors, the horizontal axis being the number of stumps (rounds) in AdaBoost.
That the error is larger when the number of rounds is even is probably an artifact of the method; but note that the error for small odd numbers of rounds is close to decreasing. Something dramatic happens between 17 and 21 rounds. Due to the small amount of training data, I would guess that the model is overfitting at 21 rounds and that the best model is the one with 17 stumps (rounds), and training error around 0.15.
Using the test data, the model with 17 stumps had a test error of about 0.25. The model with 21 stumps had a test error of about 0.35. To visualize the results, I selected 3 images at random from the 25,000 and have displayed them and the rectangles for their vectors below. The captions give the 17-stump model’s output.
7.Disclaimers #1 & #2: The task of finding a particular type of curve or shape in an image, while quite applicable, might be called an ill-defined problem — since it does not have an objectively correct answer. The question of spam email was also. Such questions are an important part of data science.
Also, there are rather good algorithms for finding curves in images (though particular kinds of curves can be more of a challenge). These algorithms use different techniques. We might discuss one of them when we get to cluster analysis. 8.Disclaimer #3: With a somewhat small training set there is potential for bad results. This is especially true considering that I am personally labeling the training and test data. 9.The images in the data have differing numbers of pixels. However, to use decision stumps, it is necessary that the vectors all have the same length. 10.The method of shuffling the pixels into coordinates simply doesn’t make sense for how decision stumps work, not for what the end goal of the task is. 11.The images in the data set are hundreds of pixels by hundreds of pixels. So \( mn\) is in the tens of thousands — vectors obtained this way would have tens of thousands of coordinates, as much (or more) than the number of data points. Overfitting would be almost inevitable. 12.An idea like this was proposed by Viola and Jones for the task of determining if a photo was of a human face. 13.The numbers 15 and 75 were a choice. For most images this choice should make it so that an \( \alpha\times\beta\) rectangle is at least twice as tall as it is wide. If dividing does not give an integer, then use the floor function (simply using int will work in Python) to get the closest integer below it. 14.The number 10 was a choice; a different choice could well improve results. 15.A measure derived from the taxicab metric on the grid of rectangles was used. The measure penalized vertical taxicab distance a bit more than horizontal, rather than finding the closest ri it found the four closest and averaged their taxicab distance to the given r, and it was multiplied by a constant that served to normalize the values between images. 16.We are more concerned with the set of rectangles, rather than their order according to the tb score.
The last post discussed the tradeoff that arises from the decomposition of Mean Squared Error into a Bias and a Complexity (and a Noise) term. In situations where other measures than MSE are being used for the loss of a model function (in classification problems, for example), a similar decomposition can be seen.
This and the next post will address an idea for how to take a hypothesis class that has strong Bias (the set of model functions is not robust enough for one of them to be a good classifier on the data) and combine the functions, with weights, to get a good classifier. In essence, taking weighted sums of functions from the hypotheses with strong bias adds needed Complexity, and enough to reduce the generalization error.
Notions of learnability. Returning to the setting of binary classification, the post Risk Minimization with Bias, contains a theorem about getting small error (not more than a given \( \varepsilon\)) with high probability (at least \( 1-\delta\), for a given \( \delta\)) when the the training set is sufficiently large and the hypothesis class has only finitely many functions. When a similar statement can be made about any given hypothesis class \( \mathcal H\) (not necessarily finite), then \( \mathcal H\) is called PAC learnable.
A class of functions \( \mathcal H\) which satisfies the Realizability assumption is called PAC learnable if for any \( \delta, \varepsilon \in (0,1)\) there is a positive integer \( m\) so that if \( S\) is sampled i.i.d. from \( \mathcal D\) and \( |S| \ge m\) then
\( L_{\mathcal D, h^\star}(h_S) \le \varepsilon\)
with probability at least \( 1-\delta\).
To reflect the terminology of \( \mathcal H\) being “learnable,” I will use the term “learner” or “learning algorithm” for the rule or algorithm that finds a model \( h_S\in\mathcal H\) with small error, given the training set \( S\).
It may be that fully implementing the ERM rule on \( \mathcal H\) is computationally undesirable — it requires finding a function that achieves a global minimum over \( \mathcal H\). It can be beneficial to consider a notion of learnability that is weaker than PAC learnable.
A learner for a class of functions \( \mathcal H\) is called a \( \gamma\)-weak learner if for any \( \delta \in (0,1)\), and any distribution \( \mathcal D\) on \( \mathcal X\), there is a positive integer \( m\) so that if \( S\) is sampled i.i.d. from \( \mathcal D\) and if \( \mathcal H\) is realizable1, then whenever \( |S| \ge m\) the learner will determine \( g_S\) so that
with probability at least \( 1-\delta\). The hypothesis class \( \mathcal H\) is called \( \gamma\)-weak learnable if there is a \( \gamma\)-weak learner for \( \mathcal H\).
How can \( \gamma\)-weak learners be understood? First, note that the learned function \( g_S\) need not come from \( \mathcal H\). So there is another hypothesis class (possibly not containing the true label function) from which \( g_S\) is drawn.
Secondly, there is an easily described model (strictly speaking, it is not a function) that will have loss equal to 1/2: Predict the label for \( {\bf x}\in\mathcal X\) to be +1 or -1 at random, each with equal likelihood (IOW, flip a coin to guess the label). Since each point in \( \mathcal X\) has a correct label in \( \{\pm1\}\), the process has a 50% chance of being incorrect on \( {\bf x}\).
Remember, the loss of a model is the probability that it will be wrong. So a \( \gamma\)-weak learner will produce a model that is a bit better than the flip of a coin2. That doesn’t sound all that great. However, a procedure to find such a thing might be much less expensive computationally. Furthermore, there is a way to combine multiple models coming from a weak learning algorithm in order to improve the accuracy.
1.Recall that the realizable assumption on \( \mathcal H\) means there is an \( h^\star \in \mathcal H\) that has zero true loss — correctly labels all data. 2.The number \( \gamma\) is positive, so \( 1/2 – \gamma\) means lower than 50% chance of being wrong.
Getting a weak learner from a more basic hypothesis class. For the rest of this post, and the next post, I will work with a particular class of functions as a basic hypothesis class. Following the ERM rule in this basic class will be an example of a weak learner for a different hypothesis class. Given data vectors \( {\bf x} \), where \(\mathcal X \subseteq \mathbb R^d\) for some dimension \( d\), the class of functions consists of all functions \(f:\mathbb R^d \to \{1, -1\}\) where, for some \(b\in\mathbb R\), some \(\theta\in\{1,-1\}\), and some \(j\in\{1,2,\ldots,d\}\), you have
\( f({\bf x})=\text{sign}(\theta(b – x_j)).\)
Call this class of functions \(\mathcal B^d\). So, a function from \( \mathcal B^d\) is determined by \( b,\theta,\text{ and }j\), and it acts by taking a \( d\)-dimensional vector to the sign of \( \theta(b-x_j)\), with \( x_j\) being the \( j^{th}\) coordinate of the vector. Just to determine a convention for what to do if \(b = x_j\), let’s say that \(\text{sign}(0) = 1\).
The functions in \( \mathcal B^d\) are called decision stumps. Considering the post Binary Classification, decision stumps are special types of half-space models3. Their special type makes them very quick to compute from a training set.
3.The normal vector is \( {\bf w} = (0,\ldots,0,-\theta,0,\ldots,0)\) (where \( -\theta\) appears in position \( j\)) and the shift/bias of the hyperplane is \( \theta\cdot b\).
A \( \gamma\)-weak learnable hypothesis class using decision stumps. In this section, consider a class of functions \( \mathcal H^d\) that can have value 1 on a bounded interval of one coordinate and -1 elsewhere (or vice-versa). That is, for numbers \( a,b\) with \( a<b\), one coordinate, say the \( j^{th}\) coordinate, of vectors in \( \mathbb R^d\), and \( \theta=\pm1\), define \( w_{a,b,\theta,j}({\bf x}) = \begin{cases}\theta,\ \text{if }a\le x_j\le b \\ -\theta,\ \text{otherwise}\end{cases}\).
Then,
Theorem. The ERM rule on \( \mathcal B^d\) is a \( \gamma\)-weak learner for \( \mathcal H^d\), for any \( 0<\gamma<\frac16\).
Idea of Proof. Let \( \mathcal X\) be a data set with correct label function \( h^\star\in \mathcal H^d\) and with probability distribution \( \mathcal D\). Let \( h^\star = w_{a^\star, b^\star, \theta^\star, j^\star}\). Then, regardless of \( \mathcal D\), at least one of the regions in \( \mathbb R^d\) listed below does not contain more than 1/3 of \( \mathcal X\) (according to the measure determined by \( \mathcal D\)):
For ease of discussion, say it is the third region, \( \{b^\star < x_j\}\). Then, for any \( \varepsilon>0\), a sufficiently large sample \( S\) will determine a decision stump \( h\) from \( \mathcal B^d\) that has probability less than \( \varepsilon\) of being wrong on the other two regions4. Thus, \( L_{\mathcal D,h^\star}(h) \le 1/3 + \varepsilon\).
Choosing \( \varepsilon < 1/6\) makes \( \gamma = 1/6 – \varepsilon\) be such that \( L_{\mathcal D,h^\star}(h) \le 1/3 + \varepsilon = 1/2 – \gamma\). \( \square\)
4.There are some details missing here. In particular, this guarantee comes from \( \mathcal B^d\) being PAC learnable and we have yet to explain why that is true. It is a consequence of the VC dimension (which equals 2 in this case), a subject of a future post.
Implementation of ERM on stumps. Given a training set, or sample set of points \( S \subset \mathbb R^d\), how do you carry out ERM on \( \mathcal B^d\)? And can you efficiently find the \( h_S\in\mathcal B^d\) with minimal empirical loss? In this section, I describe a procedure to achieve an even more general goal, in anticipation of the next post where AdaBoost is described.
As an example, consider the figure to the right. The red and blue points are the sample points in \(S\), and the blue points have (correct) label \(1\), while red points have label \(-1\). If I use the second coordinate for decision stumps (meaning \(j=2\)), then decision stumps would give one label in \(\{1,-1\}\) to the points above the horizontal line \( y=b\), and the other label to points below it (which label depends on \( \theta\)). For this example, there will be errors on some points, no matter where the horizontal line is (no matter what \(b\) is; some possible horizontal lines have been drawn). If vertical lines are used (\(j=1\)), then there will also be some error, no matter where the vertical line is. The ERM rule is to find, among all vertical and horizontal lines (since \(d=2\)), the one with minimal empirical risk (that is wrong the least often)7.
I would like to express the empirical risk in a way that will be useful later, for the purpose of combining different decision stumps into one model function. To do so I will use the notation 𝟙\(_{[{\bf x}_{i,j}>b]}\) as shorthand for6
𝟙\(_{[{\bf x}_{i,j}>b]} = \begin{cases}1, {\bf x}_{i,j}>b\text{ is true;} \\ 0, {\bf x}_{i,j}>b\text{ is false.}\end{cases}\)
As an example of using this notation, \(\displaystyle\sum_{{\bf x_i}\in S}\)𝟙\(_{[{\bf x}_{i,j}>b]}\) would be equal to the number of points \({\bf x}_i\) in \(S\) for which the \(j^{th}\) coordinate, \({\bf x}_{i,j}\), is bigger than \(b\).
If \(\theta=1\), and \(S = \{{\bf x}_1, \ldots, {\bf x}_m\}\), then the decision stump determined by \(\theta, j\), and \(b\) will evaluate to \(-1\) on \({\bf x}_i\) if and only if \({\bf x}_{i,j} > b\). (Recall how the functions in \(\mathcal B^d\) are defined above.) Therefore, the empirical risk of this decision stump is equal to
If \(\theta=-1\), then the values of the decision stump switch: it gives \(1\) on \({\bf x}_i\) if and only if \({\bf x}_{i,j} \geq b\). And so, when \(\theta=-1\), the empirical risk is
The example training set at the beginning of the section has 26 points, and a decision stump with minimum empirical risk is given by \(\theta=-1, j=2,\) and \(b \approx 0.45\). There are four -1 labeled points (red) that are above this horizontal line. This means that \(\sum_{i : y_i=-1}\frac{1}{m}\)𝟙\(_{[{\bf x}_{i,j}\geq b]} = 4/26\). There is just one +1 labeled point (blue) that is below the line, so \(\sum_{i : y_i=1}\frac{1}{m}\)𝟙\(_{[{\bf x}_{i,j}<b]} = 1/26\). And so, the empirical risk of this decision stump is \(5/26\).
Now I will introduce a method to put different weights on the training points. The goal is to make it so that getting an error on a point that has a larger weight will contribute more to the empirical risk. Finding a decision stump that minimizes the weighted empirical risk will then be more likely to give a correct label to the points that have larger weight.
Let \( {\bf p} = (p_1,\ldots,p_m)\) be a probability vector9. A vector entry \(p_i\) will be used as the weight of the contribution made to the loss when the model is incorrect at point \({\bf x}_i\). If \( \theta=1\), and \(j\) is given \((1\le j\le d)\), then define
If the weights are equal, with each \( p_i = 1/m\), then \( F_j(b)\) is the regular empirical risk from before, \(L_S(h_{\theta=1,j,b})\), where \( h_{\theta=1,j,b}\) is the decision stump. A simple adjustment can be made for the case when \( \theta=-1\) by switching the inequalities that compare to \( b\), analogous to what I did with the regular empirical risk above. So to follow the ERM rule it is necessary to find \( \text{argmin}_{\theta,j,b}(F_j(b))\). (That is, find the \( \theta, j,\) and \( b\) that minimize \( F_j(b)\).)
While there are infinitely many possibilities for \(b\), they do not all need to be considered to find \( \text{argmin}_{\theta,j,b}(F_j(b))\). First, sort the \( {\bf x}_i\) in ascending order according to their \( j^{th}\) coordinate; since it does not hurt to change which point in \(S\) is called \({\bf x}_1\), which one is \({\bf x}_2\), and so on, adjust the indices so that \({\bf x}_1, \ldots, {\bf x}_m\) satisfy that \({\bf x}_{i,j} < {\bf x}_{i+1,j}\) for every \(1\le i\le m-1\). Now, define
\( B_j = (\text{set of averages of consecutive }{\bf x}_{i,j}) \cup \{{\bf x}_{1,j} – 1\} \cup \{{\bf x}_{m,j} + 1\}\)
In other words, \(B_j\) contains numbers of the form \(\frac{{\bf x}_{i,j} + {\bf x}_{i+1,j}}{2}\), and it also contains \({\bf x}_{1,j}-1\) and \({\bf x}_{m,j}+1\). Then, every \( F_j(b)\) is equal to one where \( b \in B_j\). There are \( m+1\) numbers in \( B_j\) so, for each \(\theta\) and each \(j\), only \( m+1\) values of \( F\) need to be checked.
For each \( b\), there are \( m\) terms to be added, and you do this for each \( j = 1,\ldots,d\), and for \( \theta=1,-1\). Then you take the \( \theta,j,\) and \( b\) that give the smallest \( F_j(b)\). The runtime for this description of ERM on decision stumps is \( \mathcal O(dm^2)\).
More efficient implementation. Suppose that \( b\in B_j\), and \( b’\) is the next larger element in \( B_j\). Say \( {\bf x}_k\) is the one element of \( S\) between \( b\) and \( b’\) (so \( b < {\bf x}_{k,j} < b’\)). Then, if \( y_k=1\) (and \( \theta=1\)), the weighted loss using \( b\) is \( p_k\) greater than when using \( b’\). This is because \( {\bf x}_k\) is given label \(-1\), by the decision stump, when using \( b\), and label \(+1\) when using \( b’\). The roles reverse if \( y_k=-1\). So, when \(\theta=1\), we get
If \(\theta=-1\) instead, then \(F_j(b’) = F_j(b) + y_kp_k\). So, in general, we have
\(F_j(b’) = F_j(b) – \theta y_kp_k\)
when \(b\) and \(b’\) are consecutive numbers in \(B_j\). Once \( F_j(b)\) is calculated for the smallest \(b\), a recursive calculation gives each other \( F_j(b’)\), for that \( j\), each in constant time by using the equation. If implemented this way, then the process of minimizing the values \( F_j(b)\) runs in linear time. Good sorting algorithms run in at most \( \mathcal O(m\log m)\) time10, so implementing ERM on decision stumps can be done in \( \mathcal O(dm\log m)\) time.
I have placed below some code for this implementation of decision stumps in Python. The figure to the right shows the same example dataset as above. When a uniform weight is used (\( {\bf p} = (1/26,\ldots,1/26)\)) the decision stump with minimal loss has \( \theta=-1, j = 2\) and \( b \approx 0.447\) (the dashed line). When two blue points that are farthest to the right are weighted very heavily, the minimal loss decision stump has \( \theta=-1, j=1\) and \( b\approx 0.788\) (the solid line).
def erm_ds(S, P):
# S: two lists, one of X vectors and one of Y labels
x = np.array(S[0])
y = np.array(S[1])
min_j = -1 # will be coordinate for stump with min loss
d = len(x[0]) # dimension
min_params = [] # parameters for min loss stump
for t in range(2):
theta = 2*t-1 # theta=-1 when t==0, theta=1 when t==1
for j in range(d):
b = np.nan
Flist = [] # list of losses for stumps
# next 3 lines: sort data in coord j; make labels,P match
sorted_i = x[:,j].argsort()
xsort, ysort = x[sorted_i], y[sorted_i]
Psort = np.array(P)[sorted_i]
# stump with lowest b mislabels any theta-labeled pt.
F = sum([Psort[i] for i in range(len(P)) if ysort[i]==theta])
Flist.append(F)
for i in range(len(x)):
# efficient implementation: compute this F from last F
Flist.append(Flist[i] - theta*ysort[i]*Psort[i])
min_i = np.argmin(np.array(Flist))
if (min_i == 0):
b = xsort[min_i][j] - 1
elif (min_i == len(x)):
b = xsort[min_i-1][j] + 1
else:
b = 0.5*(xsort[min_i-1][j] + xsort[min_i][j])
# get one of following for each theta and each j
# store a 4-tuple
min_params.append((theta, j+1, b, Flist[min_i]))
# minimum according to F values
lowest_one = np.argmin(np.array(min_params)[:,3])
# return the triple (theta, j+1, b) for min loss stump
# j+1 is used since first coordinate usually called x_1 (not x_0)
return min_params[lowest_one]
6.The notation \( {\bf x}_{i,j}\) means the \( j^{th}\) coordinate of the training vector \( {\bf x_i}\). This function is sometimes called the characteristic function, for the set where the statement is true. 7.If \( {\bf P}\) consists only of fractions \( 1/m\), then least wrong simply means the one that has the least number of points mislabeled. For other probability vectors, the decision stump is penalized more for mislabeling certain points, the ones with a probability larger than the others. 8.The notation \(i : y_i = 1\) below the summation sign means that you are summing only over those \({\bf x}_i\) which have \(y_1=1\) as their label. It is similar for \(i : y_i = -1\) on the second summation. 9.A vector with non-negative entries that sum to 1. 10.In some cases, sorting can be done in linear time. This could be such a case (I haven’t thought carefully about it). If so, then ERM on decision stumps can be done in linear time.
The previous post introduced having a hypothesis class \( \mathcal H\), and the model \( h_S\) determined from a training set \( S = \mathcal X^{train}\). In practice, think of choosing an algorithm that works on a training set and determines a model (for example, the linear algebra procedure to determine the LSR line from data). 1 If \( \mathcal H\) is more restricted (has more bias), then the assumptions are stronger about how the data is really distributed. Of course, introducing too much bias will hurt the predictive potential of the resulting model, as well.
In this post, we will see a decomposition of the expected value of a loss function (the MSE in regression) into the Bias and the Complexity of \( \mathcal H\), and we will see the need to balance them in order to get high quality results from machine learning algorithms.
Not complex enough.
As a spot check, consider an extreme case where \( \mathcal H\) contains only constant functions.
Binary classification: would mean the model can only output +1 for every instance, or output -1 for every instance (so \( |\mathcal H| = 2\)). That is bad news if the true distribution of the data, \( \mathcal D\), has both labels +1 and -1 on significantly many points in \( \mathcal X\).
Linear regression: would mean you are restricting to horizontal regression lines (since having constant \( y\)-values on the line gives a constant model output). For simplicity, suppose that the source of the data requires the \( y\)-values to be in some interval \( [a,b]\), and the distribution of \( y\)-values is uniform in that range. 2
Among all possibilities for \( S\), it is more likely that observed \( y\)-values in \( S\) will have an average that is close to the overall average \( \bar y = \frac{a+b}2\), than it is that their average would not be near \( \bar y\). 3 In fact, the expected horizontal regression line would be \( y = \bar y\). Since data is distributed uniformly throughout the range, this expected model will have high variance between the observations and the model’s prediction. That is,
will be large. This is a simple example of the Bias being large. It is somewhat like a gambler noticing that the average outcome of a (single-zero) roulette wheel is 18, and deciding to always bet on 18.
Complexity and the Bias-Complexity decomposition.
Suppose that the MSE is being used to measure error (it is the loss function) of the model. 4 To use MSE in order to understand the expected generalization error, rather than understanding it for just one model, it is necessary to consider the variance between \( h_S({\bf x})\) and \( y\) not only over the instance space \( \mathcal X\), but also over the set of possible training sets (each of which gives rise to a particular model \( h_S\)).
Notationally,5 you want to understand the average \( \mathbb E_{S, {\bf x}, y}[(h_S({\bf x}) – y)^2]\). We will break this expression up in a couple of ways in order to express it in pieces that involve how you expect the algorithm and its resulting model to perform on average.
Aside: Let \( \bar h\) be the expected model function as you vary the training set. This can be difficult to wrap your head around, since it is a weighted average of functions. The mathematics behind this boils down to having a “measure” on the set of functions (a way of giving some subsets a size), and that allows you to integrate something that produces a function as its output.
In practice, you can approximate \( \bar h\) as follows. Take a “sufficiently large” number of training samples \(S_1,S_2,\ldots,S_N\), with each containing the same number of points, drawn i.i.d. from the data distribution \( \mathcal D\). If \( h_i\) is the model determined by \( S_i\), then for each \( {\bf x}\)
By definition \( \bar h = \mathbb E_S[h_S]\), and also the factor \( (\bar h({\bf x}) – y)\) is independent of changing \( S\). So the mixed term equals 0:
Using the definition of \( \bar y({\bf x})\), and fact that \( \bar h({\bf x}) – \bar y({\bf x})\) is independent of \( y\), the mixed term when you expand also vanishes. The end result is that the expected generalization error \( \mathbb E_{S,{\bf x},y}[(h_S({\bf x})-y)^2]\) is equal to
The three terms of the decomposition are the Complexity (Variance), the Bias, and the Noise. Before discussing each of them, notice that none of them are ever negative and they sum to the expected error. This means that, if one wishes to keep expected error constant, an increase in Complexity requires a decrease in Bias and vice-versa. Moreover, since none of them can be less than zero, you want to avoid an overly high Complexity, regardless of Bias (and likewise avoid an overly high Bias).
The Noise measures the average variance of the functions \( D_{\bf x}\). It is not affected at all by \( \mathcal H\) and is intrinsic to the data being studied. As such, it represents an unavoidable lower bound on generalization error.
The Complexity measures how much variation in \( h_S\) can be caused by varying the training data. The more specialized that \( h_S\) is to a particular \( S\), the larger the Complexity. Having high Complexity means you are overfitting — the functions in \(\mathcal H\) that are being selected, based on sampled training data, have more degrees of freedom than the data distribution warrants. 6
The Bias measures how much the expected \( y\)-values vary from the average prediction from models in \( \mathcal H\). That is, after averaging out the variation resulting from different training sets, it is the variance from the expected labels. This part of the error will be larger when \(\mathcal H\) does not contain enough (or not complicated enough) potential functions to fit the data well.
Example.
Let’s consider using linear regression in a simple example. I will assume that there is no Noise — the observed \( y\)-values are equal to \( \bar y({\bf x})\), and there will only be 12 points total in \( \mathcal X\), all with equal weight. Each expected value in the decomposition is, therefore, a simple average.
There are \( 2^{12} – 13 = 4083\) possible training sets, of all sizes,7 and the expected model function \( \bar h\) is given by the line that is the average of the LSR lines for these sample sets.8
I generated 12 points for the set \(\mathcal X\), which appear in the figure to the right. They were generated with the random package in Numpy:9
the array of \( x\)-coordinates being sampled uniformly in the range \([0,4)\) and the \( y\)-coordinates given by a function that also used the random package. More specifically, the coordinates were generated by calling the following:
x = 4 * np.random.random(size=12)
y = (1.6 + np.exp(-(x-2)**2))*x + 1.1 + 0.6*np.random.normal()
Taking advantage of the itertools package in Python, you can obtain all size m+2 subsets of the 12 indices of the points with
Calculate and store the slopes, intercepts, and mean squared errors of the LSR lines for all \( \binom{12}{m+2}\) training sets of size m+2 in a for loop, and do so for each m in range(11) (so m+2 will go between 2 and 12).
for t in tuples:
x_list = [x[t[i]] for i in range(m+2)]
y_list = [y[t[i]] for i in range(m+2)]
opt = np.polyfit(x_list, y_list, 1)
yhat = opt[0]*np.array(x) + opt[1]
slopes.append(opt[0])
intercepts.append(opt[1])
mses.append(metrics.mean_squared_error(yhat, y))
Since each single \( S\) is equally likely, the expected model function is the line with slope np.mean(slopes) and with intercept np.mean(intercepts). This line is drawn in black below, the LSR line of all 12 points is in green, and a plot of the LSR lines for 1/8 of the subsets is to the right.
The Complexity is found by calculating the average variance (over x) between the models \( h_S\) and the expected model \( \bar h\). In this example, the Complexity is around 1.1. The MSE of the expected model — which is the Bias here as we assumed no Noise — is approximately 0.902. Adding these together you get the expected MSE of the various lines (up to rounding error). That is, by defining yhat to be equal to the \( y\)-values for the expected model function, and s to be sum of the variances between \( h_S\) and \( \bar h\), the output of the following code, on these points, was \( \approx 8.88\cdot 10^{-16}\):
One can repeat the calculation, but for higher degree polynomials: you must keep track of more coefficients (\( d+1\) coefficients for a degree \( d\) polynomial), and change the degree in the line above where polyfit was used.
With this example set of points, using quadratic polynomials, the Complexity came to approximately 117.4, the Bias to about 0.5. For cubic polynomials, the Complexity was approximately 12893 and the Bias was 2.353. The expected model functions are shown in red above.
From one point of view, the choice of algorithm determines a hypothesis set \(\mathcal H\) implicitly. ↩
A uniform distribution on the \(y\)-values could make perfect sense here. Remember we have said that the hypothesis class \(\mathcal H\) is made of constant functions, so how the position of \(y\) might depend on \(x\) is being ignored anyway. ↩
Think about what \( S\) would need to look like to get the average of observed \(y\)-values to be barely below \( b\). While getting one such sample set has the same likelihood as any other one sample (with the same size), the lack of room for the points make the number of such sets much less than those that have \( y\)-values that are spread over the interval. ↩
This can be done for other loss functions too; the MSE simply provides an example that is convenient and mathematically simple. ↩
This section will be littered with notation of the form \( \mathbb E_*\lbrack\ \underline{\ }\ \rbrack\), the asterisk being a few subscripts. The notation stands for the expected value of what appears (the function) in the brackets. That is, it is the average value of that function, where the meaning of average is determined by the subscripts: what you average over.
Computationally, each is a center of mass calculation using the appropriate probability density function. Recall the pdf \( D_{\bf x}\) introduced in a previous post.
As an example, given \(\bf x\) and a function that predicts a \(y\)-value of \(h_S({\bf x})\), the expected value \(\mathbb E_{y}\lbrack(h_S({\bf x}) – y)^2\rbrack\) is equal to \(\int_{\mathcal Y}(h({\bf x}) – y)^2 D_{\bf x}(y)dy\). ↩
The definition of Complexity may make it appear to be unrelated to the accuracy of the model on the data. However, the decomposition that was derived implies differently. ↩
Since the system is under-determined if there is only one point (or none), the subsets of that size are ignored. ↩
In general practice you, of course, cannot consider all possible training sets (even those near a given size). ↩
As in previous posts, the abbreviation np has been made for the package numpy. ↩
ERM with no assumptions. Last post covered the concept of Empirical Risk Minimization (ERM) for finding a model function \( h:\mathcal X \to \mathcal Y\), with a focus on binary classification. Given a training set \( \mathcal X^{train} = S\), the empirical risk \( L_S(h)\) is the fraction of points from \( S\) that \( h\) incorrectly labels. The ERM rule is to use a function, a classifier, that minimizes the empirical risk (training error). However, the example in that post (classifying points in the upper-left square) provided an important lesson:
( small training error ) \( \;\not\!\!\!\implies\) ( small generalization error ).
At least, this is so when \( h\) is permitted to be any possible function — when no assumption is made on \( h\).
To rephrase this setback: though only the training set is accessible, the goal is to learn (approximately) the true distribution \( \mathcal D\). Thinking of nothing else beyond making the training error zero will be insufficient. This may seem insurmountable, but remember that the training set is sampled from \( \mathcal D\) and so it is to be expected (on average) that \( S\) reflects something about \( \mathcal D\), provided the size of the training set is sufficiently large.
ERM with Bias (or, use Science). I’ll begin with a clarification, hopefully one that you assumed. The word bias discussed here is not meant with the negative connotation that it has in a cultural context. The two uses do, however, have similar semantic origin — some prior assumption(s) is(are) being made. In the case of ERM, the assumptions are about the data and \( \mathcal D\). Essentially, to introduce bias, you make the set of possible model functions smaller.
It is, perhaps at first, a bit counter-intuitive, but restricting the set of available classifying functions gives better generalization results. If I were a statistician, I would say that the reason is noise in the data12. I am not a statistician, though.
I would like to take an alternate discussion route (one that dovetails well with the idea of signal vs. noise, I think). From centuries of scientific thought came a hard-won lesson: deciding on the hypothesis after the fact, after examining the data, makes for terrible inferences. The results won’t generalize (in scientific language, won’t be repeatable). That is the essence of the scientific method; and ML is supposed to be, after all, part of data science.
The scientific method: Make a hypothesis before looking at the observations (data). Careful analysis afterwards either supports or rejects the hypothesis.
In the ML setting, deciding that you will only work with certain types of functions is a hypothesis about the true distribution of the data (with labels), \( \mathcal D\). Call the set of possible functions that you might get from your algorithm a hypothesis class.
For example, in the binary classification setting, if your hypothesis class (the set of functions that are possible through the algorithm you use) are half-space models, then you are making a hypothesis that the data is linearly separable. Good sample data and a large error should make you reach the conclusion that the population data is not linearly separable.
I will write \( \mathcal H\) for the hypothesis class of functions. For the purposes of the sample \( S\), and the task under consideration, the empirical risk is minimized within \( \mathcal H\), so the function \(h\) that you use satisfies \(h\in\mathcal H\). The testing of the hypothesis involves taking the \(h\) obtained from ERM within \(\mathcal H\) and computing the loss on a test set of data (which was not used to determine \(h\)).
Already, this represents one aspect of learning. If the ERM rule within \( \mathcal H\) produces bad results, then you learn that \( \mathcal H\) is not a good reflection of the true picture, not a good reflection of \( \mathcal D\). Another aspect of learning will come up in the future, when \( L_S(h)\) (something related to it, rather) will be used to “update” the current \( h\) in an attempt to minimize the loss.
1.(Sorry if I just put words into statisticians’ mouths.) 2.The nature of noise is that it does not have predictive value, but it occurs on top of a phenomenon/relationship that can be used for prediction (a signal). The training data (which are observations) have that noise “baked-in.” The goal is to model the signal, not the noise also.
A class \( \mathcal H\) where ERM gives low generalization error, probably.
The goal of this section is to prove that if there are only finitely many functions in \( \mathcal H\), then the ERM rule will give a model that approximately knows the distribution \( \mathcal D\) (i.e. has low true loss on the distribution), with high probability3. This will be done under some assumptions on \( \mathcal H\) and on \( \mathcal D\) and \( S\). It will also concern binary classification problems. Some of the assumptions could be loosened, but are there for simplicity.
To be more precise, let \( h_S \in \mathcal H\) be a model given by the ERM rule for the training set \( S\): \( h_S\) minimizes empirical risk among functions in \( \mathcal H\). The theorem to prove is:
Theorem. Given \( 0 < \delta < 1\) and \( \varepsilon > 0\), then for any \( \mathcal D\) and \( \mathcal H\) (within the assumptions), there is a sufficiently large size of \( S\) such that \( L_{\mathcal D, h^\star}(h_S) \le \varepsilon\), with probability \( 1 – \delta\).
The assumptions.
⁍ Finitely many hypotheses: The set \( \mathcal H\) is finite. Denote its size by \( |\mathcal H|\).
⁍ Realizability: There is a true labeling function in \( \mathcal H\). IOW, there is an \( h^\star \in \mathcal H\) with true loss equal to zero. (Note: the algorithm, which uses \( S\) to determine \(h_S\), might find an \( h_S \ne h^\star\).)
⁍ Independently, identically distributed (i.i.d.) sampling: Each point in \( S\) is sampled only according to the fixed distribution \( \mathcal D\), regardless (independent) of the sampling of the other points.
🐉🐉 Let \( m\) be the number of points in a training set. The i.i.d. sampling assumption implies that the probability of choosing \( S\) is the same as the probability of choosing a length \( m\) list \( [x_1,x_2,\ldots, x_m]\), with each \( x_i\) sampled according to \( \mathcal D\). For example, if \( \mathcal X\) is finite (with \( N\) points, say) and \( \mathcal D\) is the uniform distribution, then each \( x_i\) has probability \( 1/N\), so the probability of each list is \( 1/N^{m}\). (It is possible that \( x_i\) repeats; this is sampling with replacement.) The notation \( S \sim \mathcal D^m\) is sometimes used. For the entire proof, we assume that \( S\sim\mathcal D^m\) every time we write \( S\).
Proof: The proof hinges on identifying a revealing pair of subsets: a subset of “bad” hypothesis functions in \( \mathcal H\) and a subset of “misleading” training samples. The bad set of hypothesis functions is
Note that a function in this set will have high loss (higher than we want), no matter what training set gets used. The subset of misleading training sets is
\( M = \{S\ |\ \text{there is }h \in \mathcal H_B \text{ where }L_S(h) = 0\}.\)
If you think about the set definition for \( M\), it contains all the training sets that could produce an \( h_S \in\mathcal H_B\) under the ERM rule4. So, for a training set in \( M\) there is a bad hypothesis that looks like a good one, according to \( S\).
We want \( \delta\) to be an upper bound on the probability that a training set comes from \( M\), and it produces an \( h_S \in \mathcal H_B\). That is, we are looking to bound the probability of
For any particular \( h\in\mathcal H_B\), the probability that it correctly labels a single \( x_i\) is equal to \( 1 – L_{\mathcal D,h^\star}(h)\), which is less than \( 1 – \varepsilon\) by the definition of \( \mathcal H_B\). Notice that \( 1-\varepsilon \le e^{-\varepsilon}\) (by the convexity of the exponential function, and its derivative being equal to -1 at 0: Figure! \( \to\)). This means the probability that one \( h\in\mathcal H_B\) will correctly label every point in a size \( m\) sample is less than \( e^{-\varepsilon m}\). Therefore, we have shown that
If we let \( {\displaystyle m \ge \frac{\ln(|\mathcal H|/\delta)}{\varepsilon}}\), then this implies that the probability that \( L_{\mathcal D,h^\star}(h_S) \le \varepsilon\), which equals \( 1 – {\bf P}(S\ |\ L_{\mathcal D, h^\star}(h_S) > \varepsilon)\), is larger than
3.The theorem in this section, and its proof, is found in Chapter 2 in the Shalev-Shwartz and Ben-David book (see reference books). 4.All, because of the Realizability assumption — since \( h^{\star}\in\mathcal H\) has \( L_S(h^\star)=0\) for all \( S\), the only way to get a bad model is for it to also have zero training error on \( S\).
Hypothesis class sizes, in theory and practice. Consider the setting of linear regression on points in the plane. The hypothesis class is the set of functions
\( \mathcal H = \{x\mapsto wx + b\ |\ b, w \in\mathbb R\}.\)
Since two lines cannot agree if their ordered pairs \( (b, w)\) do not agree, this hypothesis class is in bijective correspondence with \( \mathbb R^2\), an infinite set. For linear regression with instances \( {\bf x_i} \in \mathbb R^d\), the model function takes the form \( {\bf w}\cdot{\bf x} + b\), where \( {\bf w} \in \mathbb R^d\). Thus, in that more general case, the hypothesis class for linear regression corresponds to points in \( \mathbb R^{d+1}\).
The hypothesis class for the half-space model, with instances in \( \mathbb R^d\), is in correspondence with points a space that is “close to” \( \mathbb R^{d}\). What about logistic regression models?
Point of order: while this discussion of infinite hypothesis classes is true in theory, computers use floating-point numbers, and so a computer system only has access to finitely many numbers for each coordinate. The implication is that, in practice, these hypothesis classes are actually finite.
Importantly, finite does not mean small. A typical computer (with 64-bit processor) has a precision of approximately 16 decimal digits (53 bits) and an exponent up to about 3 decimal digits (11 bits). This means that for each exponent there are a little less than \( 10^{16}\) numbers (there is a bit of overlap as you vary exponents, but it is small). So, roughly, there are \( 10^{16}\) distinct numbers for each exponent of 10 and there are approximately 1000 exponents. This makes the number of available numbers about \( 10^{19}\).
Thus, when doing half-space models (in practice) with instances having dimension \( d\), the set \( \mathcal H\) has size on the order of \( 3(10^{19d})\) — the model is determined by a \( (d+1)\)-tuple \( ({\bf w}, b)\), each coordinate has about \( 10^{19}\) possibilities, and some of these tuples determine the same function5. If considering the above theorem, recall the realizability assumption. If the “true” distribution is linearly separable with a separating hyperplane in \( \mathcal H\), then the theorem applies.
For example, under the assumptions and in dimension two, the theorem guarantees the resulting half-space model is accurate 99% of the time (\( \varepsilon = 0.01\)) with probability 99% (\( \delta=0.01\)) if the sample \( S\) has at least \( 100\cdot\ln(300\cdot 10^{38}) \approx 9,300\) points6. (The bound is pretty rough; in fact, you don’t need nearly so many points for the highly probable, high accuracy.)
OTOH, recall the model function, discussed near the end of the last post, for deciding that a point in \( [0,1]\times [0,1]\) is not in the upper-left square:
Allowing for \( S\) to be any possible training set, this also defines a finite hypothesis class, in practice (since floating point numbers only allow for finitely many points in the space). However, the class is huge. Does a large \( S\) guarantee 99% accuracy with probability over 99%?
Note: there are approximately \( 10^{16}\) numbers for each coordinate, so there are \( 10^{32}\) points in the instance space. Even if \( |S| = 10^{31}\) (an absurdly large training set), the function \( h_S\) will still only be accurate on \( 10^{31} + (9\cdot 10^{31})/4 = (1/10 + 9/40)\cdot 10^{32}\) points, which is a loss of \( 0.675 = 1 – 0.325\) (it is wrong 67.5% of the time).
5.Multiplying \( b\) and all the coordinates of \( {\bf w}\) by the same positive number does not change the half-space model function. That means pairs \( ({\bf w}, b)\) and \( ({\bf w’}, b’)\) determine the same element of \( \mathcal H\) if there is a number \( c>0\) so that \( {\bf w} = c{\bf w’}\) and \( b = cb’\). Thus every pair \( ({\bf w}, b)\) determines the same model as \( ({\bf w’}, 1)\), \( ({\bf w’}, 0)\) or \( ({\bf w’}, -1),\) where \( |b|{\bf w’} = {\bf w}\) if \( |b|\ne 0\) and \( {\bf w’} = {\bf w}\) otherwise. The count then follows, there being \( 10^{19d}\) possible vectors \( {\bf w’}\). 6.Note that size of \( S\) needed grows linearly in the dimension, with a large slope if \( \varepsilon\) is small.
We have under our belt two basic machine learning models and their related ideas – for classification problems, the half-space model; for modeling problems, linear regression. Upon considering how those models are determined, in particular how linear regression was determined, there is a very sensible question to ask: How is this learning? In this post1, we’ll take a step back to reconsider the overarching goal for ML model functions — minimizing the mistakes the model will make on yet unseen data. The probability of a mistake is often called the risk of the model, the generalization error, or the (true) loss.
The perspective that considers a model’s loss will make it possible to generate more powerful predictors, and it will provide the setting to make what we are doing seem more like learning.
Introducing Probability
In the Binary Classification post, I stated that a machine learning task roughly attempts to solve one of two types of problems: a classification problem (given \( {\bf x}\), what type of thing is it?) or a modeling/regression problem (what relationship does \( y\) have to \( {\bf x}\)?). In that discussion, I described the classification problem as an attempt to approximate a “true” labeling function \( h^\star: \mathcal X \to \mathcal Y\) with another function \( h\). For classification, \( \mathcal Y\) is a finite set containing elements that represent each possible type. The modeling problem can be described similarly, with \( \mathcal Y\) being a continuum (for example, \( \mathbb R\) or an interval of \( \mathbb R\)).
Here I describe machine learning tasks from a slightly different, but not incompatible, perspective. In the tasks of ML, uncertainty (arising from incomplete information, from necessary simplifications, or from physical nature itself) must not be assumed away, not completely anyway.
Instead of looking for a function so that \( h({\bf x}) \in\mathcal Y\) for each \( {\bf x}\in\mathcal X\), perhaps we make a random variable for each \(\bf x\). That is, we consider a sample space of observations of getting the point \(\bf x\) (from the data setting), rather than just \({\bf x}\in\mathbb R^n\). The target space is the labels \(\mathcal Y\) (or a way of placing the labels in \(\mathbb R\)), and the probability mass function (pmf)2 defined on \(\mathcal Y\) is often thought of as being the conditional probability \(p(y\ |\ {\bf x})\). (See this post for a refresher on random variables.)
The pmf (or pdf, possibly) determined by \({\bf x}\) is sometimes called the distribution, given \({\bf x}\), so I will use the notation \( D_{\bf x}: \mathcal Y \to \mathbb R\) for this pmf. If it is a pmf, then we understand its properties by summing over labels in \(\mathcal Y\); if it is a pdf, then we use integrals over \(\mathcal Y\) instead. Merely to simplify notation, from now on I will write statements in terms of integrals over \(\mathcal Y\), or a part of \(\mathcal Y\). If, in an example, the random variable is discrete, then I ask you, the reader, to make it the appropriate sum in your head. So, we know that
\( {\displaystyle\int_{y\in\mathcal Y} D_{\bf x}(y)\ dy = 1}.\)
I already discussed one model in this way, in the topic of the logistic regression model. The sigmoid function that was used, \( f({\bf x}) = \sigma(k({\bf w}\cdot{\bf x} + b))\), did not have values in \( \mathcal Y = \{+1,-1\}\). Instead, you think of the model assigning to \( {\bf x}\) the distribution \( D_{\bf x}(1) = f({\bf x})\), \( D_{\bf x}(-1) = 1 – f({\bf x})\). 3 Knowing this distribution, one could get a deterministic function so that \(h({\bf x})=1\) when \(f({\bf x}) > 0.5\) and \(h({\bf x})=-1\) otherwise (an instance of using MLE, I suppose). However, the value of \(f({\bf x})\) gives additional information, about your certainty of the label.
Returning to the ML task, assume still that there is a true labeling function. So for each \( {\bf x}\in\mathcal X\) there is a label \( h^\star({\bf x}) \in\mathcal Y\). However, this function is unknown and, for interesting tasks, unknowable. There is a joint probability distribution \( \mathcal D\) on \( \mathcal X \times\mathcal Y\) — a random variable for observations and a random variable for labels — so that observation, label pairs are drawn from \(\mathcal D\), so \(({\bf x}, h^\star({\bf x})) \sim\mathcal D\). Notice that \( D_{\bf x}\) is the conditional distribution, for \( \mathcal X={\bf x}\).
One can only access a training set \( \mathcal X^{train}\) and its labels \( \mathcal Y^{train}\); these were sampled from the distribution \( \mathcal D\), which is also unknown (if we knew \( \mathcal D\) and \( h^\star\) already, then there is no need for ML!).
The goal of an ML algorithm is to produce a model that approximates \( h^\star\) well, with the only information being \( \mathcal X^{train}\) and \( \mathcal Y^{train}\). To make “approximates well” more concrete, a model \( h\) will have an error or loss. The loss can be thought of as follows: sample \( ({\bf x}, y)\) from \( \mathcal D\), compute the model’s label \( h({\bf x})\). What is the probability that it was wrong? It is the measure, with respect to \( \mathcal D\), of those \( {\bf x}\) where \( h({\bf x}) \ne h^\star({\bf x})\). This probability of being wrong is the risk, true loss, or generalization error of \( h\) and is equal to
It is impossible to compute \( L_{\mathcal D, h^\star}(h)\). Is there something that can be computed, and would represent a best guess to the loss? Since only the training set is accessible, you could compute the number of training labels the model gets wrong, divided by the number of training points:
The \( S\) is for notational simplicity, but is just the training set, \( S = \mathcal X^{train}\), with the letter \( S\) standing for “sampled”; the size of \(S\) is \(m\). Each \(\bf x_i\) in the equation above is one of the points in \(S\). The function \( L_{S}(h)\) is often called the empirical risk or training error of \( h\).
With a measure for the training error, it seems intuitive to find an \( h\) so that \( L_{S}(h)\) is minimized. This is the paradigm of Empirical Risk Minimization (ERM).
ERM alone is not enough
Consider the binary classification problem, \( \mathcal Y = \{1, -1\}\). Given training data \( S=\mathcal X^{train} = [{\bf x_1},\ldots,{\bf x_m}]^\top\) and \( \mathcal Y^{train}=(y_1,\ldots,y_m)\), you could try using the following function to classify a point \(\bf x\):
Notice that \( L_{S}(h^{\times})\) is always equal to 0, for any sampled training set coming from a distribution \( \mathcal D\). That seemed too easy; you should be skeptical.
Does \( h^{\times}\) actually give a bad classifying function, even though its training error is zero?
Consider an example situation. It will hopefully make the absurdity of the model \( h^{\times}\) apparent.
Suppose that the instance space, the set of potential data points, is \( \mathcal X = [0,1]\times[0,1]\) (a data point could potentially be any point in the plane which has both of its coordinates between 0 and 1). The task is to decide if a point is outside the red square depicted below: if a point is in the square, its correct label is -1, and a point outside the square has label +1.
Suppose that \( \mathcal X^{train}\) is the set of 50 points that are displayed. There are 37 points outside the upper-left, red square and 13 points inside it. Then the definition of \( h^{\times}\) causes it to assign +1 to the 37 points that are blue and -1 to those which are red.
Spot check. By its definition, what is \(h^{\times}({\bf x})\) for some point \(\bf x\in\mathcal X\) that is not in the set of 50 training points?
The points in \(\mathcal X\) are in the plane and uniformly distributed. Since a set of 50 points has area zero in the plane, the probability of picking one of them in a random draw from \(\mathcal X\) is zero. So, being correct on those points is immaterial to the generalization error. It is necessarily the case that \( L_{\mathcal D, h^\star}(h^\times) = 3/4\): with probability one, points in the square and out of the square get label -1, making \( h^\times\) wrong 3/4 of the time.
While the example may have seemed somewhat artificial, it does clearly illustrate the issue with the classifier \( h^{\times}\). So, the approach that simply attempts to minimize empirical risk (with no requirement on the function used, at least) produces bad results.
Moving forward
In a way, one that must be overcome, the function \( h^{\times}\) focused too much on the training set, and on getting 0 for the training error. What might be done to improve how the model works on non-training data?
The model function \( h^{\times}\) is an example of overfitting. Another example can occur in the polynomial generalization of linear regression. Say you use a degree \( m-1\) polynomial with a training set of size \( m\). Such a polynomial would fit the training data perfectly (for essentially every training set), so it would have training error 0. But this happens simply because you sampled \(m\) points from \(\mathcal D\), not due to some relation between points in \(\mathcal X\) and those in \(\mathcal Y\) (arising from what the distribution \( \mathcal D\) looks like). Such a high degree polynomial tells you essentially nothing about \( \mathcal D\), and so the regression will have high generalization error.
What can be done to avoid overfitting since only the training data can be accessed? A valuable approach, somewhat surprisingly, is to restrict the types of model functions allowed. This is called introducing bias. By saying the model must be of a certain type, you only allow for a selected kind of relationship between \( {\bf x}\) and \( \hat y\). The next post will discuss introducing bias.
This post follows the first part of Chapter 2 in the Shalev-Shwartz and Ben-David book (see reference books). ↩
Or the probability density function (pdf), in the event of a continuous random variable. ↩
So, \(\bf x\) determines a Bernoulli distribution. ↩
In the last post I discussed how to find the LSR line for a set of points in the plane, that is, for a set of instances \(\mathcal X^{train} = [x_1,x_2,\ldots,x_m]^{\top}\) that are in \(\mathbb R\), with labels \(\mathcal Y^{train} = (y_1,y_2,\ldots,y_m)\) that are in \(\mathbb R\). Here I will describe a Python implementation of that process “from scratch,” meaning I will create the needed matrix and compute the coefficients for the regression from it. I will also show how to use built-in functions from the NumPy package.
The data I will use for the example comes from the 1985 Ward’s Automotive Yearbook from the UCI Machine Learning Repository. I will also briefly cover the coefficient \(R^2\), an alternative to the MSE for measuring how well the line fits the data.
Recall from the last post that the intercept \(\hat b\) and slope \(\hat w\) of the LSR line are the coordinates of the vector
where \(A\) is the matrix having \(m\) rows of the form \([1, x_i]\), for \(i=1,2,\ldots,m\).
In a Python environment, I import NumPy with abbreviation np, and say that I call my list of instances x_list (so this is a list containing \(x_1,x_2,\ldots\)). Then, I define A by
A = np.array([ [1, x] for x in x_list ])
Next, calculate \(A^{\top}A\) and store it in the variable AtA by calling
AtA = np.dot(A.transpose(), A)
The matrix \(A\) has only two columns and, in general, \(A^{\top}A\) is invertible if the columns of \(A\) are independent. This means that, as long as there is more than one unique number in x_list, then AtA will be invertible. Setting y_list to be the list of labels, I can get the solution in (1) with
I will show an example of how this works on two columns from the Automotive Yearbook dataset: the column containing the engine-size of the vehicle, and the column with the city-mpg, the miles per gallon for side-street travel.
I import the package Pandas with abbreviation pd, and I make a DataFrame called auto_df. The data has been made available as a CSV file,1 with no header line, meaning we must create headers manually. Examining the Attribute Information section at the description page for the data, make a list called columns with a header name for each attribute — for the columns of interest, these names are 'engine-size' and 'city-mpg'.
A cursory glance at the data in auto_df indicates a few outliers. When I create x_list and y_list, I remove these outliers. This is shown below (the outliers are indices 55-58).
Upon doing so x_hat can be found as described above. The solution has \(\hat b \approx 39.7\) and \(\hat w \approx -0.11\).
To the right is a plot of the points (in black) that have coordinates from x_list and y_list along with a plot of the regression line \(\hat y = \hat w x + \hat b\) (in red) that was found in the previous section.
While there is clearly a negative correlation between the quantities in the plot, and the regression line has lowest MSE among lines, it doesn’t fit spectacularly well. In an effort to have a measure of error whose size is scale independent, we can consider what is called the “coefficient of determination,” \(R^2\), which is related to MSE and is between 0 and 1. To compute \(R^2\) from the MSE use the formula
where \(\bar y\) is the average of \(y_i\) and \(v_y\) is the average squared distance of \(y_i\) to \(\bar y\).
When \(R^2\) is close to 1, the points are a better fit to the line. For this dataset, the MSE is 20.726 and the coefficient \(R^2\) is 0.508.
By using a higher degree polynomial, we can make \(R^2\) be closer to 1. To get a quadratic regression polynomial, define A so that it has three columns:
A = np.array([ [1, x, x**2] for x in x_list ])
With this matrix A, the solution x_hat (found in an identical manner) contains the three coefficients of the polynomial. In other words, you want to fit a quadratic polynomial \(\hat y = \hat ax^2+\hat bx + \hat c\) to the data, and x_hat, the solution to (1), is now a 3-component vector, such that x_hat = [c_hat, b_hat, a_hat]. The graph of the polynomial is shown to the right, in red. In this example, x_hat = [56.834, -0.342, 0.001], so the LSR quadratic polynomial is
\(\hat y = 0.001x^2 – 0.342x + 56.834\).
The computation of \(R^2\) comes out as \(R^2 = 0.615\).
Polynomial regression may also be carried out using the function polyfit from the NumPy package. The function takes as input the list of \(x\)-coordinates (called x_list above), the list of \(y\)-coordinates (called y_list above), and lastly the degree of the polynomial. So, for linear regression that number should be 1. To compute the quadratic polynomial that I computed above with a matrix, the command
np.polyfit(x_list, y_list, 2)
will output the list [0.001, - 0.342, 56.834]. Note that the list of coefficients from polyfit begins with the coefficient of the highest exponent in the polynomial.