SGD for logistic regression

As the title implies, the goal of this post is to describe how to use Stochastic Gradient Descent (SGD) for a logistic regression model. Logistic regression models are central to (artificial) neural networks, our next topic (central to those that use “sigmoid neurons” at least). With SGD for logistic regression, you can find a hyperplane that does a reasonably good job (not perfect, of course) on the binary classification problem, as long as the data is “close enough” to being linearly separable.


Differentiable loss function for binary classification. When using SGD to fit a model function to data, the empirical risk does not work as a loss function (see the Empirical Risk Minimization section in this post). A differentiable function is needed.1  Such a function was introduced in the Loss function section of the logistic regression post, it is called the log-loss function. Recall, correctly labeled points, especially if far from the hyperplane, contribute close to zero to the log-loss.

A hyperplane for a logistic regression model is determined by parameters \( \tilde{\bf w}=(w_1,\ldots,w_d,w_{d+1})\) (think of \(w_{d+1}\) as the shift parameter \( b\)). Set \( \tilde{\bf x}\) to be the \((d+1)\)-vector whose first \(d\) components are those of a data point \(\bf x\), and so the last component is the number 1. With these definitions, \(\tilde{\bf w}\cdot\tilde{\bf x} = {\bf w}\cdot{\bf x} + b\), where \({\bf w} = (w_1,\ldots,w_d)\) is the normal vector to the hyperplane.

Recall that \( \sigma_k\) is the sigmoid function \( \sigma_k(t) = (1+e^{-kt})^{-1}\), which has values between 0 and 1.  Given a constant \( k>0\) and parameter vector \(\tilde{\bf w}\), the logistic regression model is defined by \(h({\bf x}) = \sigma_k(\tilde{\bf w}\cdot\tilde{\bf x})\). Recall that \(h({\bf x})\) was interpreted as the probability that the label for \( {\bf x}\) is +1.

Given a training data set \(S = [{\bf x}_1,\ldots,{\bf x}_m]^\top\), say that each \({\bf x}_i\) in the training set has correct label \(y_i\). Finally, recall the log-loss function \( \ell_S\) which measures the loss (error) of \(h\) by:

\(\displaystyle \ell_S(h) = \frac{1}{m}\sum_{i=1}^m \ln\left(1 + e^{-y_ik(\tilde{\bf w}\cdot\tilde{\bf x}_i)}\right)\).

Note that the input to \( \ell_S\) is the function \(h\), not a point \( {\bf x}\). As \(h\) is determined by the parameters \(\tilde{\bf w} = (w_1,\ldots,w_d,w_{d+1})\), we may view \(\ell_S\) as a function of variables \( (w_1,w_2,\ldots,w_d, w_{d+1})\).2  These are the variables with respect to which the gradient needs to be calculated.

The set S contains points [(1,5), (3,5), (1.5, 4), (2.5, 4), (2, 2)] having labels [1, 1, -1, -1, -1].
Now, let’s discuss one step in SGD using the log-loss function on a small set \(S\), having five labeled points in \(\mathbb R^2\). A plot of the labeled points appears to the right, with the dashed line corresponding to an initial vector \( \tilde{\bf w}_0 = (0, 1, -3)\).3  Use \( k=1\) for this example.
Recall that, on step \(t\), to update the vector \(\tilde{\bf w}_t\), you subtract \(\eta\) times the gradient of the loss function (in this case, \(\eta\nabla\ell_S\)), evaluated at \(\tilde{\bf w}_t\). After the discussion below, you will be able to check that the gradient \( \nabla \ell_S\), evaluated at \( \tilde{\bf w}_0 = (0, 1, -3)\), is approximately \( (0.60, 1.04, 0.30)\). Using a learning rate \( \eta = 0.2\), one update to the \( \tilde{\bf w}_0\) vector gives a vector corresponding to the solid line:

\(\tilde{\bf w}_1 = \tilde{\bf w}_0 – \eta\nabla\ell_S(\tilde{\bf w}_0) \approx (-0.12 , 0.792, -3.06)\).

 

The set S contains points [(1,5), (3,5), (2, 3.5), (1.5, 4), (2.5, 4), (2, 2)] having labels [1, 1, 1, -1, -1, -1]
If you alter the data set \(S\), say by including an additional positive point at (2, 3.5) as shown on the right, then the function itself \( \ell_S\) is different than before, as there is an additional point over which you sum. For this new \(S\) (and the same initial \( \tilde{\bf w}_0\)) the gradient \( \nabla \ell_S\) evaluated at \(\tilde{\bf w}_0 = (0,1,-3)\) is approximately \( (0.37, 0.65, 0.19)\). Using the same learning rate, the updated vector \(\tilde{\bf w}_1\) corresponds to the solid line on the right. Note that the extra positive point shortened the gradient vector. Even though the two negative points were in the same (wrong) position on the positive side of the initial line, if the updated line were to move across them the new blue point would cause an increase in the loss that did not exist before we added a point to \(S\). The update is trying to balance that, settling on something in between.


The gradient of \( \ell_S\).  🐉🐉 (Derivatives, and the chain rule in this section.) Calculating the gradient requires finding each partial derivative \( \frac{\partial \ell_S}{\partial w_j}\), for \( j=1,2,\ldots,d+1\). I’ll begin by computing the partial derivative of just \( \ln\left(1+e^{-k\tilde{\bf w}\cdot\tilde{\bf x}_i}\right)\), where \( \tilde{\bf x}_i\) is one of the points in \(S\) (augmented with 1 in the \(d+1\) spot). Here is that computation:

First, suppose that \(\tilde{\bf x} = (x_1,\ldots,x_d,1)\). Then

\( -k\tilde{\bf w}\cdot\tilde{\bf x} = -k(w_1x_1+\cdots+w_dx_d + w_{d+1})\),

so we have \( \frac{\partial}{\partial w_j}(-k\tilde{\bf w}\cdot\tilde{\bf x}) = -kx_j\), the \( j^{th}\) coordinate of \( \tilde{\bf x}\) times \( -k\).
Now, for \(\tilde{\bf x}_i\in S\), write \( x_{i,j}\) for the \( j^{th}\) coordinate of \(\tilde{\bf x}_i\). By using the chain rule multiple times we have

\( \frac{\partial}{\partial w_j}\left(\ln(1+e^{-k\tilde{\bf w}\cdot\tilde{\bf x}_i})\right) = \frac{\frac{\partial}{\partial w_j}(-k\tilde{\bf w}\cdot\tilde{\bf x}_i)e^{-k\tilde{\bf w}\cdot\tilde{\bf x}_i}}{1 + e^{-k\tilde{\bf w}\cdot\tilde{\bf x}_i}} = \frac{-kx_{i,j}e^{-k\tilde{\bf w}\cdot\tilde{\bf x}_i}}{1 + e^{-k\tilde{\bf w}\cdot\tilde{\bf x}_i}}.\)    (\( \ast\))

Using a bit of algebra gives \( 1 – \sigma_k(A) = \frac{e^{-kA}}{1+e^{-kA}}\), where \( A\) could be replaced by any quantity. So the above computation implies that

\( \frac{\partial}{\partial w_j}\left(\ln(1+e^{-k\tilde{\bf w}\cdot\tilde{\bf x}_i})\right) = kx_{i,j}(\sigma_k(\tilde{\bf w}\cdot\tilde{\bf x}_i)-1).\quad (\dagger)\)

Looking at the definition of \( \ell_S\), the label of \( {\bf x}\) is also in play: the exponent is \( -ky_i\tilde{\bf w}\cdot\tilde{\bf x}_i\) not just \( -k\tilde{\bf w}\cdot\tilde{\bf x}_i\). Of course, there is no difference when \( y_i = 1\).

I’ll use some notation that appeared in the post on logistic regression to write \((\dagger)\) concisely, regardless of \( y_i\). Set \( \overset{\circ}{y_i} = (y_i+1)/2\) and note that \( y_i = 1\) makes \( \overset{\circ}{y_i} = 1.\) With that in mind, if \( y_i = 1\) equation \( (\dagger)\) can be restated as

\( \frac{\partial}{\partial w_j}\left(\ln(1+e^{-ky_i\tilde{\bf w}\cdot\tilde{\bf x}_i})\right) = kx_{i,j}(\sigma_k(\tilde{\bf w}\cdot\tilde{\bf x}_i)-\overset{\circ}{y_i}).\)    (\( \ddagger\))

The same equation holds when \( y_i = -1\). To verify this you need a bit more algebra. In particular, notice that by both multiplying and dividing by \( e^{-k\tilde{\bf w}\cdot\tilde{\bf x}_i}\) you get that

\( \frac{e^{k\tilde{\bf w}\cdot\tilde{\bf x}_i}}{1 + e^{k\tilde{\bf w}\cdot\tilde{\bf x}_i}} = \frac{1}{e^{-k\tilde{\bf w}\cdot\tilde{\bf x}_i}+1} = \sigma_k(\tilde{\bf w}\cdot\tilde{\bf x}_i)\).

Since having \( y_i = -1\) essentially “removes negative signs in front of \( k\)” in the computations of (\( \ast\)), this last bit of algebra tells us that

\( \frac{\partial}{\partial w_j}\left(\ln(1 + e^{k\tilde{\bf w}\cdot\tilde{\bf x}_i})\right) = kx_{i,j}\sigma_k(\tilde{\bf w}\cdot\tilde{\bf x}_i),\)

and that equation says the same thing as (\( \ddagger\)) when \( y_i = -1\) since, in that case, \( \overset{\circ}{y_i} =0\).

Using the properties of derivatives (that is, how they work with sums and multiples of functions), the \( j^{th}\) component of the gradient has been computed:
The \( j^{th}\) component of \( \nabla \ell_S\).

\(\displaystyle \frac{\partial}{\partial w_j}(\ell_S) = \frac{1}{m}\sum_{i=1}^mkx_{i,j}(\sigma_k(\tilde{\bf w}\cdot\tilde{\bf x}_i) – \overset{\circ}{y_i})\)

for any \( j=1,\ldots,d+1\) (when \( j=d+1\) then \( x_{i,j} = 1\)). For \( j=1\), this is the first entry in the vector \( \nabla\ell_S\), for \( j=2\) it is the second entry, etc.


Using the gradient in SGD. Having determined how to calculate the gradient vector of \( \ell_S\) for an arbitrary set of points \( S\), how does one carry out SGD for logistic regression on a model \(h({\bf x}) = \sigma_k(\tilde{\bf w}\cdot\tilde{\bf x})\), and with some set of training data? Initialize \( \tilde{\bf w} = \tilde{\bf w}_{(0)}\) as some vector (it really can be any vector, though if you have an approximate guess for a good one it is useful to start there). Now, I will use \(S\) to denote the set of all training data, as before. However, note that during SGD we will not be calculating the gradient of \(\ell_S\), but of \(\ell_{S_t}\) where \(S_t\) is a subset of \(S\).4  The formula above still holds, except that you simply take the sum over points that are in \(S_t\), and you replace \(m\) with the size of \(S_t\).  In the setting of training neural networks, the subset \(S_t\) is called a minibatch.

Decide to go through some number of rounds, say \( T\) rounds. On the \( t^{th}\) round, \( 0\le t < T\), a process selects a subset \( S_t \subset S\) from the available data and \( \tilde{\bf w}_{(t)}\) is updated by5

\( \tilde{\bf w}_{(t+1)} \overset{def}{=} \tilde{\bf w}_{(t)} – \eta\ \nabla\left(\ell_{S_t}\right).\)

If choices that work well have been made along the way, including how many rounds to go through, then the final vector \( \tilde{\bf w}_{(T)}\) has coefficients of a hyperplane that give a logistic regression model with close to minimal loss. However, this is not always assured, especially if you extend to some combination of logistic regression models (as will be done in neural networks). Sometimes, because of that uncertainty, a different vector than \( \tilde{\bf w}_{(T)}\) is used. For example, the “best performing” (lowest loss) vector among \( \tilde{\bf w}_{(0)}, \tilde{\bf w}_{(1)}, \ldots, \tilde{\bf w}_{(T)}\) or, alternatively, the average of the \( \tilde{\bf w}\)-vectors.


  1. Also, it is often desirable (though not necessarily possible) to have a convex loss function, as it helps with convergence (helps SGD give “good” answers).
  2. I will keep the value of \( k\) fixed. So it will not change during the gradient descent.
  3. Meaning, it corresponds to the line \( w_1x_1+w_2x_2+b=0\), which in this case is \( x_2 – 3 = 0\).
  4. Recall (from the earlier post) that we select a subset of the training set at each step in the SGD procedure.
  5. When computing \( \nabla(\ell_{S_t})\) in this equation, \( \tilde{\bf w}_{(t)}\) should be used for the \( \tilde{\bf w}\) vector appearing in the boxed formula of the previous section.