In this post I will introduce a new loss minimization approach that is iterative and lets the algorithm “learn” the parameters of a good model. The starting place is the gradient of a function from calculus.
The gradient. Recall that if \( f:\mathbb R^{d+1} \to \mathbb R\) is a function, then the gradient of \( f\) is the vector of its partial derivatives, 1
\( \nabla f({\bf w}) = \left(\frac{\partial f}{\partial w_1}({\bf w}), \ldots, \frac{\partial f}{\partial w_d}({\bf w}), \frac{\partial f}{\partial w_{d+1}}({\bf w}) \right)\).
One of the two most important facts about the gradient that you learned in calculus was the following.
🐉🐉 (Some multi-variable calculus appears in this part. You may want to review it!)
Theorem. 2 At a point \( {\bf a}=(a_1,\ldots,a_d, a_{d+1})\) in the domain of \( \nabla f\), the values of \( f\) decrease most rapidly in the direction of \( -\nabla f({\bf a})\).
Proof. To find the direction in \( \mathbb R^{d+1}\) so that changing the input in that direction decreases the values of \( f\) most rapidly is the same as finding the length 1 vector \( {\bf u}\) such that the directional derivative \( D_{\bf u}f({\bf a})\) has its lowest value. Recall that
\( D_{\bf u}f({\bf a}) = {\bf u} \cdot \nabla f({\bf a})\),
which means that, for any \( {\bf u}\), if \( \theta\) is the angle between \( {\bf u}\) and \( \nabla f({\bf a})\) then \( D_{\bf u}f({\bf a}) = |\nabla f({\bf a})|\cos\theta\ge -|\nabla f({\bf a})|\), since \( \cos\theta\ge -1\). If you set \( {\bf v} = -\frac{\nabla f({\bf a})}{|\nabla f({\bf a})|}\) then by calculating \( {\bf v} \cdot \nabla f({\bf a})\) you see that \( D_{\bf v}f({\bf a}) = -|\nabla f({\bf a})|\). So the direction of \( -\nabla f({\bf a})\) gives the minimal value of the directional derivative. \( \square\)
⏚
Gradient descent. Say you have a model function for a data set and you would like to change the model to improve it, in the sense of decreasing its error. If you can understand the gradient of the loss function, then the above theorem presents a natural path to take:
The direction of the negative gradient is the direction of greatest decrease, so alter the model function a little bit in the direction of the negative gradient.
The amount that is “a little bit” must be chosen. Notice that I said to alter the model function in the negative gradient direction. Since the partial derivatives are with respect to model parameters, the gradient direction is in the space of parameters, and you are changing the model’s parameters a bit (as directed by the gradient). In gradient descent, you iterate this multiple times, getting closer to a local minimum each time.3
Note that the gradient is the gradient of the loss function, not of the function that is determining the predictions. It is the model function’s parameters that you alter, based on this gradient.
The following two examples illustrate why you must be careful to take small steps in the direction of the negative gradient. One does this by using a multiple \( \eta\) of the gradient. In the first example, the toy example of a loss function has only one local minimum. Even in that setting, changing \( \eta\) by only 0.1 causes the sequence of values to diverge from the minimum instead of converging to it. (The numbers 1–8 are indicating which iteration, i.e. how many times the gradient was found, and then \(\eta\) times that gradient was subtracted from the input.)
In the second example, the function has more than one local minimum. If the steps in the direction of \( -\nabla f\) are too large, as in the figure below on the right, then the descent can “escape” from one of the locations near a local minimum and end up at a different one, one that could have much larger nearby values. (Images can be enlarged by clicking on them.)
In machine learning, particularly when this is used in neural networks, the subtracted multiple of the gradient, \( \eta\), is called the learning rate. Choosing what value \( \eta\) should have is somewhat elusive. As demonstrated in the above examples, it can have a dramatic effect on the algorithm’s convergence to a model that approximately minimizes loss.4
(Example) Gradient descent for linear regression.5 Suppose that \( \mathcal D\) is the underlying distribution (in the world) that governs two quantities that are being modeled by simple linear regression — for example, say that you are using linear regression to model the price of office space compared to its square footage. The linear models are determined by a slope \( w\) and intercept \( b\), and the squared error of a model \( (w, b)\) has an expected value \( \mathbb E_{\mathcal D}[(wx+b – y)^2]\), where \( (x,y)\) is drawn from \( \mathcal D\), \( x\) being square footage and \( y\) being the observed price. Write \({\bf w} = (w, b)\) and use \( mse_{\mathcal D}({\bf w})\) for this loss function (the generalization error6). Write \( \nabla mse_{\mathcal D}({\bf w})\) for the gradient. In step \( t\) of the gradient descent process, if \( {\bf w}_t\) is (the parameters of) the current model, the next model is
\( {\bf w}_{t+1} = {\bf w}_{t} – \eta \nabla mse_{\mathcal D}({\bf w}_t)\). (\( \dagger\))
However, direct knowledge of \( {\mathcal D}\) is not available, so an alternate approach must be used as \( \nabla mse_{\mathcal D}\) cannot be computed. Stochastic Gradient Descent is the alternate method. It provides a process to (approximately) carry out gradient descent in practice, given a training set.
Stochastic Gradient Descent (SGD). The idea of stochastic gradient descent (SGD), in the setting of data analysis/machine learning, is to replace (in step \( t\)) the gradient of the true loss function with the gradient of an empirical loss function (on some of the available data). In other words, use a random process to select a subset \( S^t \subset \mathcal X^{train}\) of the training set. Then, defining \( {\bf v^t} = \nabla \text{MSE}_{S^t}({\bf w}_t)\) (or, equal to the gradient of whichever loss function on \(S^t\) that you are using), replace (\( \dagger\)) with
\( {\bf w}_{t+1} = {\bf w}_{t} – \eta {\bf v^t}\). (\( \ddagger\))
Note that \( {\bf v^t}\) is a random variable, depending on \(S^t\). The process used to find \( S^t\) should be such that the expected value of \( {\bf v^t}\) is close to \( \nabla mse_{\mathcal D}({\bf w}_t)\), provided \( S\) is a sufficiently representative sample of data drawn from \( \mathcal D\).
Using random
, from the NumPy library (as usual, numpy
is abbreviated as np
), you can generate the desired subset \( S^t\) in the following way.7
def find_St(data_points, labels, St_size): m = len(data_points) # selects St_size indices for points, in range(0,m) index_list = np.random.choice(m, size=St_size) return data_points[index_list], labels[index_list]
Using this, SGD can easily be implemented for linear regression with the mean squared error on \( S^t\) as the loss function (call that function \( \text{MSE}_{S^t}\)). The goal is to determine an appropriate slope and intercept \( {\bf w} = (w, b)\), so view these as the input to the function. This means that the gradient uses partial derivatives with respect to \( w\) and \( b\). These partial derivatives are
\(\displaystyle \frac{\partial}{\partial w}\text{MSE}_{S^t}(w, b) = \frac2{m_t}\sum_{i=1}^{m_t} (wx_i + b – y_i)x_i\)
\(\displaystyle \frac{\partial}{\partial b}\text{MSE}_{S^t}(w, b) = \frac2{m_t}\sum_{i=1}^{m_t} (wx_i + b – y_i)\),
where each \( x_i\), with label \( y_i\), is in \( S^t\) and there are \( m_t\) points in \( S^t\). Put these together to make the vector \({\bf v^t} = \nabla \text{ MSE}_{S^t}({\bf w}_t)\), where you substitute \(w\) and \(b\) in the expressions for their value at step \(t\). Starting with any old line, with slope \( w_0\) and intercept \( b_0\), a (small enough) step in the direction of \( {\bf v^t}\) will tend to (that is, on average) take the slope,intercept parameters closer to \( (\hat w, \hat b)\), the slope and intercept of the LSR line.
For an example of implementing this: the set of 150 points displayed below-left are taken from the line \( y = 4x + 2\), with some noise. I proceeded through only 15 steps of SGD, starting with \( (w_0,b_0) = (0, 0)\) and using the find_St
function with the size of \( S_t\) equal to 30 on each step. With the \( S_t\) found, I determined \( {\bf v^t}\) as described above, and used learning rate \( \eta = 0.1\). The lines for all 16 (slope, intercept) pairs (including the initial one) are plotted on the right below. The color of the line goes from blue to red as \(t\) increases. It appears to be “settling” towards the line \( y = 4x + 2\). The line in the last step is approximately \( y = 4.1x + 1.58\).
- For the end goal here, I will have functions that depend on a set of parameters (which, in turn, determine a ML model function), and I will need to take partial derivatives with respect to those parameters. In order to highlight this, I will use \({\bf w} = (w_1,\ldots,w_d, w_{d+1})\) for the input of the function. You could think of \(w_{d+1}\) as the “shift” parameter \(b\). ↩
- This result requires that the partial derivatives of \(f\) be continuous at \({\bf w} = {\bf a}\). ↩
- As you get closer to a local minimum, the gradient gets closer to zero, and you move by less and less each step. ↩
- For many loss functions, one can prove that the gradient descent process converges for an appropriate choice of \( \eta\). For example, this is known for differentiable convex functions (and even something a bit more general), but we won’t get into that. Importantly, there are cases in ML where, on the face, it seems the loss function is one where convergence is guaranteed, but it is not! Great care is needed… ↩
- Gradient descent is useful in many settings, linear regression is only an example. In the next post, we will discuss using a variant of it for logistic regression when the data is not linearly separable. This will lay the foundation of a neural network. ↩
- The generalization MSE of \( wx + b\) is the average value of \( (wx + b – y)^2\) over all \( (x,y)\) (all office space sales), not the average over just the data that is available. ↩
- The inputs
data_points
andlabels
should both be NumPy arrays. The generation ofindex_list
uses sampling with replacement, which can be changed easily (see the NumPy documentation). ↩