Backpropagation and training a neural network

:Rough draft: Once you have decided what architecture your neural network will have (number of layers, vertices and edges within each layer, and the type of activation function, among numerous other possible choices), and you have a data set for training, how do you actually train the network? That is, how do you determine the weights on the various edges? The short answer is that you use Stochastic Gradient Descent. However, for a neural network there is much more to SGD than in a single logistic regression model.

The backpropagation algorithm is a procedure for carrying out SGD on a neural network. I’ll explain it here. After that I will give an example of implementing a neural network.

The general idea of the backpropagation algorithm is clean and straightforward, but a complication arises in the implementation which muddies the idea of backpropagation when filling out details.

The General Idea. We want to compute partial derivatives of the loss function with respect to weights in the network for the purpose of stochastic gradient descent. First, compute the partials of the loss function with respect to the weights in the last layer. Then, since (mathematically speaking) the layers in neural networks are functions being composed together, the chain rule from calculus should allow us to reuse the partials that were already computed in order to more quickly find the partials with respect to weights in the second-to-last layer. After that, we can use the chain rule again to quickly find partials with respect to weights in the third-to-last layer, and so on in that manner, going backward through the network.

The reason that implementing this gets a bit complicated is due to the weights in the network not being functions that depend on the input data that gets passed through the network — the act of passing data through (“forward propagation”) is where the composition occurs. However, the weights do get used in forward propagation, allowing the idea to work, but with some more complicated details.

Recall the notation that was set up in the introductory post on Neural Networks. For simplicity, assume that the output layer \( V_L\) has one vertex. We’ll allow the activation to be any function \(a\), not only the sigmoid function. For the time being, consider just one sample point \( ({\bf x}, y)\) from the data set, \( y\) being the label.

Furthermore, say there is a current set of weights on the network (when initializing the network, use some random process to generate initial weights). Write \( {\bf W}\) for the current weights, thought of as a list of matrices \({\bf W}=\lbrack{\bf W}^{0}, {\bf W}^{1}, \ldots, {\bf W}^{L-1}\rbrack\), one for every two consecutive layers. The weight \( w^t_{i,j}\) in row \( i\) and column \( j\) of the matrix \( {\bf W^t}\) is the weight on the edge from \( v^t_j\), vertex \( j\) in layer \( V_t\), to \( v^{t+1}_i\), which is vertex \( i\) in layer \( V_{t+1}\).

Now, write \( {\bf O^t}\) for the vector of values taken by vertices in layer \( V_t\) when \( {\bf x}\) (a single sample datum) is the input to the network. That is, \({\bf O^t}\) is the forward propagated representation of \({\bf x}\). For example, \( {\bf O^0} = {\bf x}\). Using \({\bf W}_1^0\) for the first row of the matrix \({\bf W}^0\), we set \(z_1^1 = {\bf W}_1^0\cdot{\bf x}\), and then \(o^1_1 = a(z_1^1)\) is the first component of the vector \( {\bf O^1}\); the entire vector being given by \({\bf O}^1 = (a(z_1^1), a(z_2^1), \ldots) \).

We will simplify notation by allowing for vector input to the activation function \(a\), meaning that \(a\) is applied to each component of the vector. Then, the values in layer \(V_t\) of the network would be written \( {\bf O^t} = a({\bf z}^t) = a({\bf W^{t-1}}{\bf O^{t-1}})\). ¹

The backpropagation computation occurs a layer at a time. Recall that the sample datum \({\bf x}\) has been fixed. Assume that we have computed all its forward propagated representations \({\bf O^t}, t=1,\ldots,L\). By restricting to partial derivatives of \(\ell\) with respect to weights before a given layer, we can think of having a loss function for each layer (except the input layer). That is, for each \(t=0,1,\ldots,L-1\) let \(\ell_{t+1}\) be the loss (on \({\bf x}\)) of the network, which has input variables being the weights \( w^t_{i,j}\), with \(i=1,\ldots,k_{t+1}\) and \(j=1,\ldots,k_t\).

Now, we have that \( \ell_t\) can be understood from \( \ell_{t+1}\) by thinking of function composition. More precisely, \( \ell_t({\bf O^t}) = \ell_{t+1}({\bf O^{t+1}}) \), which means that values of \(\ell_t\) are given by \(\ell_{t+1}(a({\bf W^{t}}{\bf O^{t}})) \), and \({\bf O^t} = a({\bf W^{t-1}}{\bf O^{t-1}})\). For a given \(i\) with \(1\le i\le k_t\), write \(o^t_i\) for component \(i\) of \({\bf O^t}\). Note that \(o^t_i = a(z^t_i) \) and that \( z^t_i = {\bf W}^{t-1}_i{\bf O^{t-1}} \).

By the chain rule, \(\dfrac{\partial\ell_t}{\partial w^{t-1}_{i,j}} = \dfrac{\partial\ell_t}{\partial o^t_i}\dfrac{\partial o^t_i}{\partial z^t_i}\dfrac{\partial z^t_i}{\partial w^{t-1}_{i,j}} = \dfrac{\partial\ell_t}{\partial o^t_i} a'(z^t_i) o^{t-1}_j \). This means that we can quickly get this partial derivative if we have also already calculated \(\dfrac{\partial\ell_t}{\partial o^t_i}\).

Now, say that \(t < L\). Then note that for each \(k = 1,2,\ldots,k_{t+1}\), we have that \(z^{t+1}_k = {\bf W}^t_k{\bf O^t} = \sum_{i’} w^t_{k,i’}o^t_{i’}\). Hence, the partial of \(z^{t+1}_k\) with respect to \(o^t_i\) is going to be \(w^t_{k,i}\). Recall that \(\ell_t({\bf O^t}) = \ell_{t+1}(a({\bf W^{t}}{\bf O^{t}})) \), which gives us

\(\dfrac{\partial\ell_t}{\partial o^t_i} = \sum_{k=1}^{k_{t+1}}\dfrac{\partial\ell_{t+1}}{\partial o^{t+1}_k}\dfrac{\partial o^{t+1}_k}{\partial z^{t+1}_k}\dfrac{\partial z^{t+1}_k}{\partial o^t_i} = \sum_{k=1}^{k_{t+1}}\dfrac{\partial\ell_{t+1}}{\partial o^{t+1}_k}a'(z^{t+1}_k)w^t_{k,i}\).

This reduces the calculation of \(\dfrac{\partial\ell_t}{\partial o^t_i}\) to having calculated \(\dfrac{\partial\ell_{t+1}}{\partial o^{t+1}_k}\) for each \(k\), which was done previously as we backpropagate through the network. In the interest of implementing the calculation in a “vectorized” way: let \(\nabla_{\bf o}\ell_{t+1}\) be the row vector of partials of \(\ell_{t+1}\) with respect to components of \({\bf O}^{t+1}\), let \(\Delta(a'(z^{t+1}))\) be the diagonal matrix with values \(a'(z^{t+1}_k)\) on the diagonal, and let \({\bf W}^t_{\ ,i}\) be column \(i\) of the matrix \({\bf W^t}\) (as a column vector). Then the above equation tells us that \(\dfrac{\partial\ell_t}{\partial o^t_i} = \nabla_{\bf o}\ell_{t+1}\Delta(a'(z^{t+1})){\bf W}^t_{\ ,i}\).

The first computed partial derivative \(\dfrac{\partial\ell_L}{\partial o^L_1}\), from the output layer, will have a known form that depends on the loss function being used. For example, if the activation function in the output layer is \(\sigma\) and the negative log-loss function is being used, then in our current notation we would have \(\ell_L(o^L_1) = – (\overset{\circ}{y}\log(o^L_1) + (1 – \overset{\circ}{y})\log(1 – o^L_1) )\). As a result, we would have that

\(\dfrac{\partial\ell_L}{\partial o^L_1} = -\left(\dfrac{\overset{\circ}{y}}{o^L_1} + \dfrac{\overset{\circ}{y}-1}{1-o^L_1} \right) = \dfrac{o^L_1 – \overset{\circ}{y}}{o^L_1(1-o^L_1)}\).

These computations were all done on a single data point and label \( ({\bf x}, y)\). You work with a subset of the training set (called a mini-batch in neural networks) and average the partial derivatives you get for each point in the mini-batch. You then subtract \( \eta\) times that average from the current \( {\bf W}\).

In addition, you often repeat this for a set of mini-batches that cover the entire training set. After you go through the whole training set, this is referred to as an epoch of training.

There is code below that implements backpropagation, followed by some code that trains a neural network.

–footnote– the log-loss is what we are using. Explanation of how to alter it from the way it was written for single logistic regression.

–footnote– the subscript of \( \nabla\) means: evaluate at that point.

A simple training set for binary classification, containing 200 points.

To test the code on a basic example, I begin by generating the labeled data set in the plane shown at the right. This is the training set; it contains 200 points. The red points (also the blue points) in the data set are grouped quite closely to each other. Clearly this set is linearly separable; one should be able to get high accuracy from a single neuron. However, I would like to try something with a little more structure, so I set up a neural network with layers \( V_0, V_1,\) and \( V_2\), where \( V_0\) and \( V_1\) have 3 vertices each, and the output layer \( V_2\) has a single vertex. As described in the initial discussion on neural networks, the values the vertices in \( V_0\) take are the first and second coordinate of the input point, and 1 for the third vertex. The sigmoid activation function uses \( k=2\).

An example of how a neural network, trained on the above 200 points over 46 epochs, labels points.

An example of how a neural network, trained on the above 200 points over 26 epochs, labels points.

The initial 12 weights (\( 3\times 3\) between layers \( V_0\) and \( V_1\) and 3 between \( V_1\) and \( V_2\)) are each obtained by sampling from a uniform distribution on the interval \( (-1.5, 1.5)\). I ran the backpropagation, using mini-batches of size 10, for 46 epochs. The learning rate of each epoch was either 1 or 2 (alternating roughly every 10 epochs at first, but remaining 2 for the last 20 epochs). After 46 training epochs, the network had 100% accuracy on the training set (it correctly labeled each training point). The figure on the left above shows how this network labels a points near the origin.

I should note that the resulting network (and how it labels points in the plane) is quite dependent on what the initial weights were (it can also be affected somewhat by which points end up in each mini-batch). At a certain point during the training, the set of points are very likely to be near where a local minimum of the loss occurs, and so that it stays near that local minimum. However, there are almost certainly many local minima of the loss as a function of the weights. The particular one you settle near can dramatically affect how many epochs are required to obtain high accuracy. For example, on another training session, using the same training data, and where every other choice made was identical to the first, I obtained a network with 100% accuracy after only 26 epochs. Its prediction on points in the plane is shown on the right above.

In a second example, I will train a neural network on the training data shown to the right. This example is not a linearly separable set. The intention is that positively points (blue) should be contained in a bounded set around the origin (including points other than the sampled training points). First, a neural network that will approximate this intended task must have more vertices in a hidden layer than the vertices in the input layer. In fact, by considering the bounded positive region to be points that are positively labeled in all of four different logistic regression models, one can know how to explicitly construct a neural network (with weights) that will achieve what is desired. Make the hidden layer have 4+1 neurons (one for each of four logistic regressions that the central points are positive for, and one other which have a value that is important for the final layer); make one output neuron, the 5 weights being such that you approximate the AND function on the first four neurons.
The weights to achieve this are below and a network with weights near those will classify points as in the figure to the right.

\( {\bf W^0} = \begin{pmatrix}1 & 0 & 2 \\ 0 & 1 & 2 \\ -1 & 0 & 2 \\ 0 & -1 & 2 \\ 0 & 0 & 1\end{pmatrix}, \qquad {\bf W^1} = \begin{pmatrix}1 & 1 & 1 & 1 & -3\end{pmatrix}.\)

But this is certainly cheating. In real applications you won’t know how to set the weights yourself, but will need to use backpropagation to find good weights. Because of the tendency of the neural networks to fall into local minimum wells, the actual values of the weights initially can have significant influence on the final network. So what is a way to let the network find the right spot through training?

Roughly speaking, don’t always use the last set of weights from gradient descent for the weights in the next epoch. Instead, find the “best” \( {\bf W}\) among those encountered during that epoch and sometimes use that for the start \( {\bf W}\) of the next epoch. By “best,” I mean select the \( {\bf W}\) that has minimum training loss. Since during a “best” search you want to be bouncing around to somewhat distant locations, use a large learning rate.

In the example, I began with 1 epoch, with \( \eta = 5\), and I took the best \( {\bf W}\) after the epoch. I followed this by 10 epochs, with \( \eta = 2\), where the last \( {\bf W}\) was selected; then 30 epochs with \( \eta = 1\), where the best was selected (the hope here being that the network has made it near a good local minimum (a deep “well”), so the gradient might be rather short, but that well could still have other local minima within it — so I am trying to avoid those);
then 5 epochs where I average the \( {\bf W}\) vectors; then 10 epochs, with learning rate \( \eta = 0.5\), and taking the last vector. The resulting network labels points as shown. It only mislabels 1 of the 160 training points. Its approximate weights are displayed below.

\( {\bf W^0} = \begin{pmatrix}-12.1 &-4&11.8 \\ 5.1&9.3&7 \\ -4.8&13.4&-9.1 \\ 12.5&-2.7&-7.6 \\ 10.7&13.5&-12\end{pmatrix}, \qquad {\bf W^1} = \begin{pmatrix}-2.1&5.9&-5.5&-3.2&-4.6\end{pmatrix}.\)

Note: the expression \({\bf W^{t-1}}{\bf O^{t-1}\) is the result of multiplying a \(k_t\times k_{t-1}\) matrix by a vector with \(k_{t-1}\) components, giving the \(k_t\) component vector \({\bf z}^t\). ↩