Most of the data that you want to place into two classes is not linearly separable, and when it isn’t then, unfortunately, the Perceptron algorithm to determine a half-space model from the data will not terminate. However, that doesn’t mean we need to throw out the half-space model. We can “enhance” the model in a way that makes it a valuable building block of essential machine learning methods that are common practice.
The enhancement is the logistic regression model, and a bit of pondering about the half-space model leads somewhat naturally to the logistic regression model.
Logistic regression models
Recall how the half-space model function \( h:\mathcal X \to \mathcal Y\) is defined. There is a vector \( {\bf W} = ({\bf w}, b)\) that is found by training on a set of points \( \mathcal X^{train}\). A new data point \( {\bf x}\) is then labeled by setting
\( h({\bf x}) = \text{sign}\left({\bf w}\cdot{\bf x} + b\right)\)
where \( \text{sign}\) of a number is -1 if the number is negative and +1 otherwise. 1 It may feel arbitrary for \(h\) to give +1 label to points on the hyperplane — why not \(h({\bf x})=-1\)? In all honesty, for classification purposes it makes little or no difference as the hyperplane has no volume in \(\mathbb R^n\) (under any reasonable probability measure, the probability of drawing an \(\bf x\) which is exactly on the hyperplane equals 0).
More important than the values of the function \( h\) at points on the hyperplane is the behavior of \( h\) near the hyperplane. Think about \( {\bf x}\) starting on the side where \( h({\bf x})=-1\), moving along an unbroken path across the hyperplane, and ending on the side where \( h({\bf x})=1\). The label changes suddenly as it crosses the hyperplane. In machine learning, a subset in the data space \(\mathcal X\) that separates a region where points are given one label from a region where points are given a different label, like the hyperplane here, is called a decision boundary. To express that values of \(h\) change suddenly at the decision boundary, it is said that there is a “hard” decision boundary. In mathematical language, this is the statement that \(h\) is not continuous.
But it is useful to recognize that data is messy. Let me say that again: Data is messy — there can be poorly made recordings/measurements, glitches in instrumentation, factors that were not accounted for, and anomalies that are not meaningful.2 And so, even if \( h({\bf x}) = 1\), it might be that \( {\bf x}\) is close enough to the decision boundary that, were it not for the messiness, \( {\bf x}\) would have been on the negative side (and so we would have had \( h({\bf x}) = -1\)).
Continuing this train of thought, the farther away a point is from the decision boundary (on the positive side, say) the more confident we could be that it should have a +1 label. Likewise, the farther away it is to the negative side the more confident that it should have a -1 label. Near the hyperplane, our certainty goes down.
Making our decisions probabilistic
🐉🐉 To make our uncertainty in the decision be more precise, we can compose \( z = {\bf w}\cdot{\bf x} + b\) with a continuous function instead of composing it with the sign function. The logistic function (in ML, the sigmoid function),
\( \sigma(z) = \frac{1}{1 + e^{-z}},\)
makes a good choice for this continuous function.3 4 Below is its graph.
In future posts, there will be some discussion on using \( \sigma(z)\) when making neural networks with logistic regression as its basic pieces. At a basic level, by composing with \({\bf w}\cdot{\bf x}+b\), the output could be thought of as the probability of label +1 being the right label. Note that \( \displaystyle\lim_{z\to\infty}\sigma(z) = 1\) and \( \displaystyle\lim_{z\to -\infty}\sigma(z) = 0\). Since the goal is still binary classification, if the label is +1 with probability 0 then the label must be -1.
We often want a bit more flexibility than just using \( \sigma(z)\), so we introduce a parameter \( k>0\) and set \( \sigma_k(z) = \frac{1}{1 + e^{-kz}}\). If \( k\) is chosen to be between 0 and 1, then the function values transition from 0 to 1 more slowly. If \( k\) is larger than 1, then the function values transition from 0 to 1 more quickly.5
For a hyperplane determined by \( ({\bf w}, b)\), call it \( H\), a logistic regression model using that hyperplane is determined by \(k\):
\( h({\bf x}) = \sigma(k({\bf w}\cdot{\bf x} + b)) = \frac{1}{1+e^{-k{\bf w}\cdot{\bf x} – kb}}\).
For each point \( {\bf x}\) in \( \mathbb R^n\), the model function \(h\) gives a probability that \( {\bf x}\) should be labeled +1.
A visualization of these probabilities as heights on the graph of \( h\) is pictured to the right, in the case that \( {\bf x}\) is in \( \mathbb R^2\). The points with height 0.5 are those inputs that are on the hyperplane of the model.
Returning to the comment that data is messy, the parameter \( k\) can be useful. Roughly speaking, more messy data could be given a “softer” decision boundary (more uncertainty near the middle), and less messy data a more hard decision boundary. Because of how the parameter \( k\) affects the shape of the sigmoid function, this suggests that you would want \( k\) closer to 0 for messy data and relatively large for less messy data. 6 ⏚
Loss functions
So far, we have discussed a couple of loss (or error) functions for machine learning models. When looking at linear regression, we used the mean squared error. This is generally a good idea for linear regression problems.
With half-space models, particularly when discussing decision stumps, the empirical risk was used — the proportion of how many training points are not labeled correctly. While empirical risk makes sense for classification tasks, it has a disadvantage in the fact that it does not fit the more continuous approach that logistic regression makes to binary classification. In addition, it is not differentiable, but I am going to want to use optimization techniques from calculus to determine a logistic regression model from training data. We need a different loss function.
The right one to use is called the (negative) log-loss function. If the set of training data (the sample set) is \(S=[{\bf x}_1,\ldots,{\bf x}_m]^\top\), and \(h({\bf x}) = \sigma_k({\bf w}\cdot{\bf x}+b)\) is a logistic regression model, then define
\(\displaystyle\ell_S(h) = \frac1{m}\sum_{i=1}^m\log\left(1 + e^{-y_ik({\bf w}\cdot{\bf x}_i+b)}\right)\),
where \(y_i\) is the training label for \({\bf x}_i\). 7
Exercise. Given some point \({\bf x}_i\) and the parameters of a logistic regression model, define \(a=k({\bf w}\cdot{\bf x}_i+b)\). Verify that if \(y_i = 1\) then \(\log\left(1 + e^{-y_ia}\right) = -\log(h({\bf x}_i))\). (So, this is the term in the sum of \(\ell_S(h)\) that is contributed by \({\bf x}_i\).)
Also, verify that if \(y_i = -1\) then \(\log\left(1 + e^{-y_ia}\right) = -\log(1 – h({\bf x}_i))\).
One consequence of the exercise is a different way to think of the negative log-loss function. Define \(\overset{\circ}{y_i} = \frac{1}{2}(y_i + 1)\). So, if \(y_i=1\) then \(\overset{\circ}{y_i} = 1\) and if \(y_i=-1\) then \(\overset{\circ}{y_i} = 0\). Then, from the exercise, we have
\(\displaystyle\ell_S(h) = -\frac1{m}\sum_{i=1}^m\left( \overset{\circ}{y_i}\log(h({\bf x}_i)) + (1-\overset{\circ}{y_i})\log(1 – h({\bf x}_i)) \right)\).
Another consequence of the exercise is that if the value of \(h({\bf x}_i)\) is very close to \(\overset{\circ}{y_i}\) (so the model is both very certain and correct), then \({\bf x}_i\) contributes very little to the loss function \(\ell_S(h)\), it contributes close to \(-\log(1) = 0\). In complementary fashion, when the model is very certain, but wrong (i.e. \(h({\bf x_i})\) is close to 1 when \(\overset{\circ}{y_i}=0\), or close to 0 when \(\overset{\circ}{y_i}=1\)), then it contributes a large amount to the loss.
With this observation, it is appropriate to think of the log-loss function as a differentiable analogue to the empirical risk. Its derivatives, with respect to the parameters \({\bf w} = (w_1,\ldots,w_n)\) and \(b\), will be used to find values of the parameters that (approximately) minimize the log-loss function. The logistic regression model determined by these found parameters is the model function that we “train” by using the set \(S\).
- I made the convention of defining \(h({\bf x}) = 1\) if \(\bf x\) is on the hyperplane (where \({\bf w}\cdot{\bf x}+b = 0\)), so that \(h\) would be defined on all possible vectors. ↩
- Also, there is often some amount of inescapable noise. The world may not be made of deterministic functions (as nice as it is to work with such); it may be that real world outputs must have an underlying probability that govern them. ↩
- This function has long been used for logistic regression in the field of statistics. Some “reasonableness” for it comes from an inverse direction: wanting to use MLE, and making some reasonable assumptions on the conditional probability \(p\) of observing some \({\bf x}_i\), given model parameters. This leads to \(\log\left(\frac{p}{1-p}\right)\) being linear. ↩
- The logistic function has a long history in applied mathematics. It has appeared prominently in population growth models and, also, in the area of information theory. ↩
- Consider the derivative and how it looks near 0. ↩
- Interpreting the use of \( k\) in this way was an idea that I first saw made by a friend of mine, Eli Grigsby. ↩
- In other sources, you may see the log-loss function defined in a seemingly different way. However, considering the Exercise appearing after this definition, it agrees with these other sources. ↩