Linear regression

In previous posts, we discussed binary classification by using a basic model — the half-space model (or its probabilistic counterpart, logistic regression). In this post we will discuss the most basic type of regression — simple linear regression.


Linear Regression. 1

The goal of simple (one-variable) linear regression is the following. Given a set of points \((x_1,y_1), (x_2,y_2), \ldots, (x_m,y_m)\) in the \(xy\)-plane that are “near” a line (one thinks), you want a procedure to produce a line that best fits the points, in the sense that the mean squared error (MSE) is minimized. I’ll explain mean squared error below. This best fit line is called the Least Squares Regression (LSR) line.

Points “near” a line

As a running example, consider the points plotted in the image to the right. There are 50 of them. Think of these as the points \((x_i,y_i)\) where \(i\) is between 1 and \(m\) (so \(m=50\)). How do we find the LSR line that fits best?

First, I will describe a procedure that works. Then I’ll say in which sense a line would be called the best fit, after which, I will explain why the procedure does result in the line of best fit. This will lead to being able to generalize the ideas (🐉🐉), which I discuss at the end. Before that, I put this in the context of ML and getting a model \(h:\mathcal X \to\mathcal Y\) to predict labels for instances.


Finding the LSR line.

(🐉 A bit of linear algebra knowledge is needed to understand the procedure.) Using the 50 point example as a reference, if the points were honestly on a line with some slope \(w\) and some intercept \(b\), then the equation \(y_i = wx_i + b\) would hold for all 50 points \((x_i,y_i)\). For example, suppose that \((x_1,y_1) = (0.01, 0.6)\). Then, in this case (the points all on \(y=wx+b\)), it would have to be true that \(0.6 = 0.01w + b\). You would have such an equation for each point, and these 50 equations being true would say that every point is on the line with slope \(w\) and intercept \(b\). In the running example, it is visually clear that there is no such line.2

You can rephrase the statement about 50 equations being true as one matrix equation. Define a \(50\times 2\) matrix \(A\) and a vector \({\bf y}\) as follows:

\(A = \begin{bmatrix}1 & x_1 \\ 1& x_2 \\ \vdots & \\ 1 & x_{50} \end{bmatrix}\);         \({\bf y} = \begin{bmatrix}y_1 \\ y_2 \\ \vdots \\ y_{50}\end{bmatrix}\)

(In general, however many points you have, that is the number of rows of \(A\) and of \({\bf y}\)). Then the 50 equations can be written as one matrix equation (where you would solve for \(b\) and \(w\)):

\(A\begin{bmatrix}b \\ w\end{bmatrix} = {\bf y}\).

As I said, a solution for this equation does not exist (not for our running example), since no line contains all the points. So, instead of trying to solve the equation, you replace \(\bf y\) with a vector \({\bf \hat{y}}\) that does allow for a solution, and you pick \({\bf \hat y}\) as close as possible to \({\bf y}\). In linear algebra terms, the vector \({\bf y}\in\mathbb R^{50}\) is not in the column space of \(A\). So we replace \(\bf y\) with the vector in the column space that is closest to it — that vector is \({\bf \hat y}\).

To do so, calculate the \(2\times 2\) matrix \(A^{\top}A\) and solve the equation3

\(A^{\top}A\begin{bmatrix}b \\ w\end{bmatrix} = A^{\top}{\bf y}\).         (1)

LSR line with data points; slope = 0.686; intercept = 1.124.

This equation always has a solution. Call it \(\begin{bmatrix}\hat b \\ \hat w\end{bmatrix}\). The 50-component vector you get by multiplying this solution by \(A\) will be the vector \({\bf \hat y}\),

\(A\begin{bmatrix}\hat b \\ \hat w\end{bmatrix} = {\bf \hat y}\).

The LSR line has slope and intercept given by that solution: \(\hat y = \hat w x + \hat b\). In the running example, the solution comes out as \(\hat b \approx 1.124\) and \(\hat w \approx 0.686\). This LSR line is plotted in red.


Mean Squared Error (MSE).

Given the set of data points and some line (any line), write it as \(\bar y = \bar w x + \bar b\), the MSE for that line is the average of the squared errors that come from the line’s prediction \(\bar y_i=\bar wx_i+\bar b\), compared to the observed value \(y_i\). For example, if the number of data points is 50, then the MSE equals

\(\frac{1}{50}\sum_{i=1}^{50} (\bar y_i – y_i)^2\).

A line is said to be the best fit to a set of data points if, among all lines, it has the minimum MSE. The line given by the procedure above, with slope \(\hat w\) and intercept \(\hat b\), is the one with minimum MSE. This perspective, that a MSE can be calculated for any arbitrary line4 in the plane, and the LSR line minimizes it, will be brought up multiple times. As a “first example,” we will see how it follows the pattern for how to go about using the data to select a particular predictor (the particular LSR line) within a class of possible predictors (all non-vertical lines).

The reason why the LSR line minimizes the MSE comes up in the next section, where I will also explain why the solution \((\hat b, \hat w)\) always exists.


Explanation of the algorithm.

(🐉🐉 Heavy on the linear algebra.)

The matrix \(A\) that was defined above had 50 rows and 2 columns. To speak a little more generally, let’s say the number of rows (data points) is \(m\) so that \(A\) is an \(m\times 2\) matrix. The set of vectors \({\bf b}\) for which

\(A{\bf x} = {\bf b}\)         (2)

has a solution5 is called the column space \(\mathsf C(A)\) of the matrix (\({\bf b}\) is a vector in \(\mathbb R^m\)). In the setup of the problem, \({\bf y}\) is not in the column space, but it is in \(\mathbb R^m\) (see the schematic picture below).

The projection of a vector to the column space of a matrix.

There is another set of vectors in \(\mathbb R^m\) corresponding to the matrix — the null space of \(A^{\top},\ \mathsf N(A^{\top})\). These are the vectors \({\bf x}\) such that \(A^{\top}{\bf x}\) is the zero vector. It is an important fact that the vectors in \(\mathsf N(A^{\top})\) are exactly those vectors that are orthogonal to the ones in \(\mathsf C(A)\).6 A consequence of this fact is that every vector \({\bf y}\) in \(\mathbb R^m\) can be written uniquely as a vector in \(\mathsf N(A^{\top})\) plus a vector in \(\mathsf C(A)\). So there is a \({\bf p}\) in \(\mathsf C(A)\) and a \({\bf q}\) in \(\mathsf N(A^{\top})\) so that

\({\bf y} = {\bf p} + {\bf q}\)    or    \({\bf p} = {\bf y} – {\bf q}\).

Since \({\bf p}\) is in \(\mathsf C(A)\), there is a vector \({\bf \hat x}\) so that \(A{\bf \hat x} = {\bf p} = {\bf y} – {\bf q}\). Furthermore, since \({\bf q}\) is in \(\mathsf N(A^{\top})\), I can multiply by \(A^{\top}\) to get

\(A^{\top}A{\bf \hat x} = A^{\top}{\bf y}\).

IOW, there is a solution \({\bf \hat x}\) to the equation (1). Recall that this was the vector with coordinates that gave the intercept and slope of the LSR line, \({\bf \hat x} = (\hat b, \hat w)\).

Thinking about how \({\bf p}\) relates to \({\bf y}\) and \(\mathsf C(A)\) tells us why the LSR line minimizes the MSE. First of all, \({\bf q}\) is orthogonal to every vector in \(\mathsf C(A)\), since it is in \(\mathsf N(A^{\top})\). That means \({\bf p} = \text{proj}_{\mathsf C(A)}({\bf y})\), it is the orthogonal projection of \(\bf y\) to \(\mathsf C(A)\). And so, it is the closest point in \(\mathsf C(A)\) to \({\bf y}\). Finally, the MSE of any line \(\bar wx+\bar b\) equals \(1/m\) times the distance from \(\bf \bar y\) to \(\bf y\):

\(\frac{1}{m}\sum_{i=1}^m (\bar y_i – y_i)^2 = \frac{1}{m}|{\bf \bar y} – {\bf y}|^2\).

In particular, this is true for the projection \({\bf\hat y} = A{\bf \hat x} = {\bf p}\), and its distance to \(\bf y\) is smallest among vectors in \(\mathsf C(A)\). Therefore, the line \(\hat y = \hat w x+\hat b\) has the smallest MSE among possible lines.


Relation to Machine Learning.

Consider the \(m\) points we started with as a set of training instances (training data), with labels. The data in this case are \(\mathcal X^{train} = [x_1, x_2, \ldots, x_m]^{\top}\) and the labels are \(\mathcal Y^{train} = (y_1,y_2,\ldots,y_m)\). Both instances and labels in this case are in \(\mathbb R\), so the data matrix has a single column. As an example, the data could be the price of oil per barrel, recorded at some times, and you have the average airfare (of a particular flight path) at those times — your \(y_i\)’s in \(\mathcal Y^{train}\). You want to model a (rough) linear relationship for how airfare changes according to oil prices.

In notation we have used, \(\mathcal X\) would be the set of all possible oil prices, having \(\mathcal X^{train}\) as the ordered subset to which you have access. Likewise, \(\mathcal Y\) would be the corresponding airfares — each one being tied to an oil price in \(\mathcal X\) — and \(\mathcal Y^{train}\) is the list corresponding to \(\mathcal X^{train}\). The idea (hypothesis, maybe) is that there is roughly a linear relationship between \(\mathcal X\) and \(\mathcal Y\). The LSR line produces a function \(h: \mathcal X \to\mathcal Y\) that attempts to predict the label in \(\mathcal Y\) for each point in \(\mathcal X\), the prediction, for this linear model, being the height on the line, \(\hat y = h(x) = \hat wx+\hat b\). The procedure finds the parameters, \(\hat w\) and \(\hat b\), which make the prediction be the best possible for a linear model (in terms of MSE), according to the available data \(\mathcal X^{train}, \mathcal Y^{train}\).

The approach to linear regression given here benefits greatly from linear algebra, in a way that is difficult to duplicate in other (not linear regression) learning settings. In other cases, it is often necessary to apply some kind of iterative procedure in order to improve, step by step, the choice of parameters for the model.


Generalizing simple linear regression.

During the Explanation of the algorithm section, I talked about the matrix \(A\) as having 2 columns (as it does when the instances \(x_i\) are in \(\mathbb R\)). However, the discussion did not require that \(A\) have just 2 columns. It works just as well if \(A\) has more columns, and there is only one solution vector \({\bf \hat x}\) as long as the columns of \(A\) are linearly independent (when the rank of \(A\) equals the number of columns).

What about the model (what kind of regression do we have?) The vector \({\bf \hat x}\) now represents more coefficients, instead of just an intercept and slope, of a model function.

One interpretation is to have multiple features in the instances. For example, perhaps each \({\bf x_i}=(x_{i,1}, x_{i,2}, x_{i,3})\) is an ordered triple — maybe the first coordinate is the price of oil, the second is the price of nickel, the third is the price of aluminum. You could model the label (the price of airfare, for example) as a linear function \(\hat y_i = {\bf \hat w}\cdot{\bf x_i} + \hat b\) by making a matrix with 4 columns:

\(A = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & x_{1,3} \\ \vdots & & & \\ 1 & x_{m,1} & x_{m,2} & x_{m,3}\end{bmatrix}.\)

The same procedure as before will produce a vector \((\hat b, \hat w_1, \hat w_2, \hat w_3)\) that determines an affine function, which will minimize the MSE among affine functions of the same type. That is, the linear regression procedure gives the coefficients for the function with lowest mean squared error among functions of the type \(y = b + w_1x_1+w_2x_2+w_3x_3\).

An alternative way that more columns could arise is polynomial regression. Returning to an instance-label set where both \(\mathcal X\) and \(\mathcal Y\) are in \(\mathbb R\), you can model the relationship with a polynomial. Say that you are trying to fit \(y_i = b_0 + b_1x_i + b_2x_i^2 + \cdots + b_nx_i^n\). Then you can fill out the columns of \(A\) with the powers of \(x_i\) (so row \(i\) of \(A\) looks like \(1, x_i, x_i^2, \ldots, x_i^n\)). The same procedure as above produces coefficients of a degree \(n\) polynomial. This polynomial function can then be the model function \(h\). The MSE is defined identically, and among all polynomials of degree \(n\), the polynomial you get minimizes the MSE.

It is important to avoid overfitting the data. Think of overfitting, for now, as meaning that you are trying to fit variation in the (training) data too closely. Doing so gives a model that might be very accurate on the training data, but at a price. It can produce quite high error when using the model on future (non-training) data.

In polynomial regression, an extreme case is when \(m\), the number of data points, is only one more than the degree of the polynomial (and none of the \(x\)-coordinates are equal). In this case the polynomial fits the data exactly — each point is on the graph of the polynomial — so the MSE is zero. While that seems a good thing, future data will not be on that graph, with near certainty, and will likely not even be close to it. We’ll return to this when talking about empirical risk and the Bias-Complexity trade-off.


 

  1. I would guess that most people reading this post have seen linear regression in some context. Just in case, I will discuss it as if you haven’t heard of it before. Perhaps I will present a fresh perspective that is productive for you.
  2. To make this rigorous it is enough to find two separate pairs of points so that the two slopes you get, one for each line segment connecting points in a pair, are different.
  3. The matrix \(A^{\top}\) is the transpose of matrix \(A\): what you get by flipping rows to columns and vice-versa. The matrix \(A^{\top}A\) will be invertible if the columns of \(A\) are independent.
  4. Except for a vertical line, that is.
  5. Beware! The vector \({\bf x}\) in this equation is completely unrelated to the instances of the modeling problem \(x_1,\ldots,x_m\). I simply used \({\bf x}\) because you want to solve for this vector.
  6. By flipping rows and columns (taking the transpose on a product), saying that \(A^{\top}\begin{bmatrix}x_1 \\ \vdots \\ x_m\end{bmatrix}\) is the zero vector is the same as saying \(\lbrack x_1,\ldots,x_m\rbrack A\) is the zero vector, and this latter product is the set of dot products of \(\bf x\) with the columns of \(A\).

Logistic regression

Most of the data that you want to place into two classes is not linearly separable, and when it isn’t then, unfortunately, the Perceptron algorithm to determine a half-space model from the data will not terminate. However, that doesn’t mean we need to throw out the half-space model. We can “enhance” the model in a way that makes it a valuable building block of essential machine learning methods that are common practice.

The enhancement is the logistic regression model, and a bit of pondering about the half-space model leads somewhat naturally to the logistic regression model.

Logistic regression models

Recall how the half-space model function \( h:\mathcal X \to \mathcal Y\) is defined. There is a vector \( {\bf W} = ({\bf w}, b)\) that is found by training on a set of points \( \mathcal X^{train}\). A new data point \( {\bf x}\) is then labeled by setting

\( h({\bf x}) = \text{sign}\left({\bf w}\cdot{\bf x} + b\right)\)

where \( \text{sign}\) of a number is -1 if the number is negative and +1 otherwise. 1  It may feel arbitrary for \(h\) to give +1 label to points on the hyperplane — why not \(h({\bf x})=-1\)? In all honesty, for classification purposes it makes little or no difference as the hyperplane has no volume in \(\mathbb R^n\) (under any reasonable probability measure, the probability of drawing an \(\bf x\) which is exactly on the hyperplane equals 0).

More important than the values of the function \( h\) at points on the hyperplane is the behavior of \( h\) near the hyperplane.  Think about \( {\bf x}\) starting on the side where \( h({\bf x})=-1\), moving along an unbroken path across the hyperplane, and ending on the side where \( h({\bf x})=1\). The label changes suddenly as it crosses the hyperplane. In machine learning, a subset in the data space \(\mathcal X\) that separates a region where points are given one label from a region where points are given a different label, like the hyperplane here, is called a decision boundary. To express that values of \(h\) change suddenly at the decision boundary, it is said that there is a “hard” decision boundary. In mathematical language, this is the statement that \(h\) is not continuous.

But it is useful to recognize that data is messy. Let me say that again: Data is messy — there can be poorly made recordings/measurements, glitches in instrumentation, factors that were not accounted for, and anomalies that are not meaningful.2  And so, even if \( h({\bf x}) = 1\), it might be that \( {\bf x}\) is close enough to the decision boundary that, were it not for the messiness, \( {\bf x}\) would have been on the negative side (and so we would have had \( h({\bf x}) = -1\)).

Continuing this train of thought, the farther away a point is from the decision boundary (on the positive side, say) the more confident we could be that it should have a +1 label. Likewise, the farther away it is to the negative side the more confident that it should have a -1 label. Near the hyperplane, our certainty goes down.

Making our decisions probabilistic

🐉🐉 To make our uncertainty in the decision be more precise, we can compose \( z = {\bf w}\cdot{\bf x} + b\) with a continuous function instead of composing it with the sign function. The logistic function (in ML, the sigmoid function),

\( \sigma(z) = \frac{1}{1 + e^{-z}},\)

makes a good choice for this continuous function.3 4   Below is its graph.

Graph of the ‘vanilla’ logistic function

In future posts, there will be some discussion on using \( \sigma(z)\) when making neural networks with logistic regression as its basic pieces. At a basic level, by composing with \({\bf w}\cdot{\bf x}+b\), the output could be thought of as the probability of label +1 being the right label. Note that \( \displaystyle\lim_{z\to\infty}\sigma(z) = 1\) and \( \displaystyle\lim_{z\to -\infty}\sigma(z) = 0\). Since the goal is still binary classification, if the label is +1 with probability 0 then the label must be -1.

We often want a bit more flexibility than just using \( \sigma(z)\), so we introduce a parameter \( k>0\) and set \( \sigma_k(z) = \frac{1}{1 + e^{-kz}}\). If \( k\) is chosen to be between 0 and 1, then the function values transition from 0 to 1 more slowly. If \( k\) is larger than 1, then the function values transition from 0 to 1 more quickly.5

The logistic function with k = 5.

For a hyperplane determined by \( ({\bf w}, b)\), call it \( H\), a logistic regression model using that hyperplane is determined by \(k\):

\( h({\bf x}) = \sigma(k({\bf w}\cdot{\bf x} + b)) = \frac{1}{1+e^{-k{\bf w}\cdot{\bf x} – kb}}\).

For each point \( {\bf x}\) in \( \mathbb R^n\), the model function \(h\) gives a probability that \( {\bf x}\) should be labeled +1.

Graph of h

A visualization of these probabilities as heights on the graph of \( h\) is pictured to the right, in the case that \( {\bf x}\) is in \( \mathbb R^2\). The points with height 0.5 are those inputs that are on the hyperplane of the model.

Returning to the comment that data is messy, the parameter \( k\) can be useful. Roughly speaking, more messy data could be given a “softer” decision boundary (more uncertainty near the middle), and less messy data a more hard decision boundary. Because of how the parameter \( k\) affects the shape of the sigmoid function, this suggests that you would want \( k\) closer to 0 for messy data and relatively large for less messy data. 6


Loss functions

So far, we have discussed a couple of loss (or error) functions for machine learning models. When looking at linear regression, we used the mean squared error. This is generally a good idea for linear regression problems.

With half-space models, particularly when discussing decision stumps, the empirical risk was used — the proportion of how many training points are not labeled correctly. While empirical risk makes sense for classification tasks, it has a disadvantage in the fact that it does not fit the more continuous approach that logistic regression makes to binary classification. In addition, it is not differentiable, but I am going to want to use optimization techniques from calculus to determine a logistic regression model from training data. We need a different loss function.

The right one to use is called the (negative) log-loss function. If the set of training data (the sample set) is \(S=[{\bf x}_1,\ldots,{\bf x}_m]^\top\), and \(h({\bf x}) = \sigma_k({\bf w}\cdot{\bf x}+b)\) is a logistic regression model, then define

\(\displaystyle\ell_S(h) = \frac1{m}\sum_{i=1}^m\log\left(1 + e^{-y_ik({\bf w}\cdot{\bf x}_i+b)}\right)\),

where \(y_i\) is the training label for \({\bf x}_i\).  7

Exercise. Given some point \({\bf x}_i\) and the parameters of a logistic regression model, define \(a=k({\bf w}\cdot{\bf x}_i+b)\). Verify that if \(y_i = 1\) then \(\log\left(1 + e^{-y_ia}\right) = -\log(h({\bf x}_i))\). (So, this is the term in the sum of \(\ell_S(h)\) that is contributed by \({\bf x}_i\).)
Also, verify that if \(y_i = -1\) then \(\log\left(1 + e^{-y_ia}\right) = -\log(1 – h({\bf x}_i))\).

One consequence of the exercise is a different way to think of the negative log-loss function. Define \(\overset{\circ}{y_i} = \frac{1}{2}(y_i + 1)\). So, if \(y_i=1\) then \(\overset{\circ}{y_i} = 1\) and if \(y_i=-1\) then \(\overset{\circ}{y_i} = 0\). Then, from the exercise, we have

\(\displaystyle\ell_S(h) = -\frac1{m}\sum_{i=1}^m\left( \overset{\circ}{y_i}\log(h({\bf x}_i)) + (1-\overset{\circ}{y_i})\log(1 – h({\bf x}_i)) \right)\).

Another consequence of the exercise is that if the value of \(h({\bf x}_i)\) is very close to \(\overset{\circ}{y_i}\) (so the model is both very certain and correct), then \({\bf x}_i\) contributes very little to the loss function \(\ell_S(h)\), it contributes close to \(-\log(1) = 0\). In complementary fashion, when the model is very certain, but wrong (i.e. \(h({\bf x_i})\) is close to 1 when \(\overset{\circ}{y_i}=0\), or close to 0 when \(\overset{\circ}{y_i}=1\)), then it contributes a large amount to the loss.

With this observation, it is appropriate to think of the log-loss function as a differentiable analogue to the empirical risk. Its derivatives, with respect to the parameters \({\bf w} = (w_1,\ldots,w_n)\) and \(b\), will be used to find values of the parameters that (approximately) minimize the log-loss function. The logistic regression model determined by these found parameters is the model function that we “train” by using the set \(S\).


  1. I made the convention of defining \(h({\bf x}) = 1\) if \(\bf x\) is on the hyperplane (where \({\bf w}\cdot{\bf x}+b = 0\)), so that \(h\) would be defined on all possible vectors.
  2. Also, there is often some amount of inescapable noise. The world may not be made of deterministic functions (as nice as it is to work with such); it may be that real world outputs must have an underlying probability that govern them.
  3. This function has long been used for logistic regression in the field of statistics. Some “reasonableness” for it comes from an inverse direction: wanting to use MLE, and making some reasonable assumptions on the conditional probability \(p\) of observing some \({\bf x}_i\), given model parameters. This leads to \(\log\left(\frac{p}{1-p}\right)\) being linear.
  4. The logistic function has a long history in applied mathematics. It has appeared prominently in population growth models and, also, in the area of information theory.
  5. Consider the derivative and how it looks near 0.
  6. Interpreting the use of \( k\) in this way was an idea that I first saw made by a friend of mine, Eli Grigsby.
  7. In other sources, you may see the log-loss function defined in a seemingly different way. However, considering the Exercise appearing after this definition, it agrees with these other sources.

The Iris Dataset

This post is focused on a well-known dataset — the Iris flower dataset — which came to be commonly used (as a typical test case of new statistical classification techniques) after R.A. Fisher1 discussed it in a 1936 paper2. The set contains 150 data points, each one corresponding to an individual Iris flower from one of three Iris flower species: Iris setosa, Iris virginica, and Iris versicolor. Each species makes up an equal portion of the dataset (there are 50 points for each species) and each point in the dataset has four coordinates, one per measurement: sepal length, sepal width, petal length, and petal width, all in centimeters.

Iris setosa, virginica, versicolor. Images by G. Robertson, E. Hunt, Radomil. © CC BY-SA 3.0

Let’s use the Perceptron algorithm to find a hyperplane that separates the Iris setosa species from the other two species — the setosa species is, in this data set, linearly separable from the others. Those points that are from an Iris setosa plant will be given label \(y_i = 1\), and the others will be given label \(y_i = -1\).

In this process, I will also demonstrate using a training set \(\mathcal X^{train}\) to determine the model \(h:\mathcal X \to\mathcal Y\). The goal is for the model to have low generalization error, meaning that with high likelihood it will correctly label data points that it did not see during training. For this reason, I don’t use all the available data. Instead, I sample a subset, use that to train, and hold the rest of the points in reserve. After obtaining the model, I can then check the model’s performance on the reserve points.

After choosing \(\mathcal X^{train}\), I implement the Perceptron algorithm in Python on \(\mathcal X^{train}\). The primary choice for implementing the algorithm concerns how to find a “mistake” — a vector \({\bf X_i}\) where \(y_i {\bf W^{(t)}}\cdot{\bf X_i} \le 0\). (Once there are no “mistakes,” the algorithm ends.) The data in \(\mathcal X^{train}\) will come in a particular order. My choice for finding mistakes: cycle through \(\mathcal X^{train}\) in the given order; if I am at the point \({\bf X_i}\) in the list and it is a mistake, update the \({\bf W}\) vector: \({\bf W}^{(t+1)} = {\bf W}^{(t)} + y_i {\bf X_i}\). Then, continue cycling through \(\mathcal X^{train}\) from the current position. Once the procedure makes a full cycle through \(\mathcal X^{train}\), finding no mistakes, it stops.


Code (with comments) to implement the algorithm. The list of data is called x; the list of corresponding labels is called y. To account for the possibility that the input is not linearly separable, it contains a cut-off maxr, which is the maximum number of runs through the whole set that is allowed. The default value is 15000, but the user can choose a different value.

The package NumPy is used here to deal with vectors. I created a shorthand, np, to call something from NumPy (a rather common shorthand).

def perceptron(x, y, maxr = 15000):
  # Initiate W as list of zeros, with length = one plus length of first vector in x
  W = np.array([0]*(len(x[0])+1))
  # Make bigx vectors: 1 appended at end
  bigx = np.array([np.append(x[0],1)])
  for i in range(1,len(x)): 
    bigx = np.vstack((bigx, np.append(x[i],1))) 
  # counter: total number of times so far through instance list; 
  # done: remains 1 to end of while loop only if all the products are >0
  # T: number of updates to W 
  counter, done, T = 0, 0, 0
  # while loop ends if counter reaches maxr 
  while (all([done == 0, counter <= maxr])): counter += 1 done = 1 for i in range(len(bigx)): if not(y[i] * np.dot(W, bigx[i]) > 0): 
        done = 0 
        W = W + y[i] * bigx[i] 
        T += 1 
  print('Finished computation: T = ', T) 
  return W

I can then load the Iris dataset within Python by importing load_iris from sklearn.datasets3. This gives a Pandas DataFrame4, which I named df_iris, the first 50 rows of which are points for the species Iris setosa. As a consequence, a list of valid labels would be formed by

 label_set = np.array([1]*50 + [-1]*100)

Then, I create a training subset by randomly sampling row indices between 0 and 149, inclusive, and creating a new DataFrame with those rows:

 training_rows = np.random.choice(150, size=60, replace=False)
 Xtrain = df_iris.loc[training_rows,df_iris.columns].reset_index(drop=True)

The above code does random sampling without replacement. (Notice the replace=False option.) Finally, the following will produce the vector \({\bf W} = ({\bf w},b)\) that determines the hyperplane.

W_result = perceptron(Xtrain.values, label_set[training_rows])

The vector W_result will depend on which rows were placed in Xtrain. Regardless, it will likely produce a perfect linear separation of Iris setosa from the other two species. The implementation seems to consistently need only a single digit number of iterations (the value T, which equals the number of times \({\bf W}\) is changed).

To test W_result on the points held in reserve we must check, for each point, whether it is on the positive or negative side of the W_result hyperplane. Then compare that to its label in label_set. Below I provide functions that will carry out this test. The function halfspace_function takes a list, containing W_result only, and a point; it returns +1 or -1 depending on the side of the hyperplane the point resides. The function loss_01 takes as input the name of a model function, the list of parameters that determine the function (in this case, [W_result]), the list of data, and the corresponding labels. It returns the total number of points that were mislabeled by the model function.

So, to test our answer, we call the following. It likely returns 0 (in some cases it may return a small number of mistakes, depending on the training set).

loss_01(halfspace_function, [W_result], df_iris.values, label_set)

def halfspace_function(param_list, x_pt):
  W = param_list[0]
  bigx_pt = np.append(x_pt, 1)
  assert len(W) == len(bigx_pt)
  function_value = np.dot(W, bigx_pt)
  # Choose to have prediction 1 when x_pt is on the hyperplane
  if function_value == 0:
    return 1
  else:
    return int(function_value/abs(function_value))
def loss_01(h, parameters, instance_data, labels):
  total_loss = 0
  for i in range(len(instance_data)):
    if (h(parameters, instance_data[i]) != labels[i]):
      total_loss += 1
    else:
      continue
  return total_loss
  1. Fisher was a statistician and biologist. His contributions are many, including the statistical techniques ANOVA & Maximum Likelihood Estimation.
  2. The use of multiple measurements in taxonomic problems, by R.A. Fisher. The measurements for the data were taken by the botanist Edgar S. Anderson.
  3. See load_iris documentation here.
  4. See Pandas documentation.

Binary Classification

What kind of problems do we try to solve in machine learning? My attempt to answer this splits it into two types of problems: Classification problems and Regression problems. Roughly speaking, classification means the task of assigning a given data point to one of a (small) number of classes of things. For example, make an algorithm that decides from an image of a hand-written letter (the data point), which letter of the alphabet is in the image. By modeling, I mean the task of finding a meaningful relationship (a correlation, perhaps) between features of the data. Does lifespan increase as lifetime average salary does, for example?

In this post we’ll spend time on classification problems (AKA decision problems), and on the simplest type — binary classification. Often, binary classification problems are phrased in a way where the goal is a yesor no prediction (yes, this data point does fit in the class, or no it does not).

A common example is classifying emails as spam or not spam. To build a spam detector, you must decide how the emails will be represented to the algorithm. It’s likely that you want to use some coordinates that describe the contents of the email. For example, you might count how often words are mispelled, how often a sentence ends with an exclamation point, and the frequency of words being fully capitalized (“ACT NOW!”). These frequencies/coordinates, would then represent the email as a vector.

After this, the “learning” part can occur. You are given a training dataset — emails that have already been labeled yes (spam) or no (not spam). You also pick a model to use. The model is determined by some parameters, which you choose initially in some way. The model can then guess the output on the training data, check how wrong it was overall (the loss or error function), and has some way to update the parameters in order to reduce the loss — attempting to minimize the loss function.

🐉 More generally, there is an instance space \( \mathcal X\) of input vectors that you want to make decisions about, 1 and label space \( \mathcal Y\), the` possible labels for an instance from \(\mathcal X\). Use a function to describe this relationship: if \( y\) is the label for instance \(\bf x\), then write \( h({\bf x}) = y\) to mean \( y\) is the label of \(\bf x\). This is a function \( h: \mathcal X \to \mathcal Y\). Our goal is to find such a function, so that \(h({\bf x})\) is correct as much as possible. We need to do so using only information from the training data set (called \( \mathcal X^{train}\); this is a subset of \(\mathcal X\)), and the corresponding labels \( \mathcal Y^{train}\), a list of values from \(\mathcal Y\), each value corresponding to a point in \(\mathcal X^{train}\). In binary classification, we often let \( \mathcal Y = \{-1, 1\}\).

To get a better feel for how this can work, let’s discuss the most basic example, a half-space model. Remember that the data has been represented with coordinates, that is, a list of numbers. Call the number of coordinates \( n\) (each \( {\bf x} \in \mathcal X\) is in \( \mathbb R^n\)). You have to find a way to decide whether the answer is yes or no (whether \( h({\bf x}) = 1\) or \( h({\bf x})=-1\)) based on those coordinates.

Hyperplane with coefficient vector when n=2.

Lines are good at cutting a plane in two parts, right? Their generalization to more than two coordinates, a hyperplane in \( \mathbb R^n\), is equally nice. A hyperplane, call it \( H\), splits \( \mathbb R^n\) into two half-spaces that meet at \( H\). As a bonus, it is easy to tell in which half-space you are.
🐉 To determine \( H\) you need \( n\) coefficients, call them \( w_1,w_2,\ldots,w_n\), and one more number \( b\). The points of \( H\) are exactly those points \( (x_1,x_2,\ldots,x_n)\) where

\( w_{1}x_{1} + w_{2}x_{2} + \ldots + w_{n}x_{n} + b = 0\)

is a true equation. Notice that if you group the coefficients into a vector \( {\bf w} = (w_1,w_2,\ldots,w_n)\) and call \( {\bf x} = (x_1,x_2,\ldots,x_n)\), then the hyperplane is where \( {\bf w}\cdot{\bf x} + b = 0\). Think of \( {\bf w}\) as determining the direction that the hyperplane faces, and \( b\) is a shift away from the point \( (0,0,\ldots,0)\).

🐉 In fact, for an arbitrary point \( {\bf x}\) in \( \mathbb R^n\), whether \( {\bf w}\cdot{\bf x} + b\) is positive or negative determines on which side of the hyperplane \( {\bf x}\) lies. So the half-space model sets \( h(x) = 1\) if \( {\bf w}\cdot{\bf x} + b > 0\) and \( h(x)=-1\) if \( {\bf w}\cdot{\bf x} + b < 0\).2

Thus the half-space model is determined by a vector \( {\bf w}\) and number \( b\). With luck, you can find a \( {\bf w}\) and \( b\) where the model output matches the known labels on the training data reasonably well. If you are really lucky, then there is some \( {\bf w}\) and \( b\) so that every possible data point gets labeled correctly by the output of the model. When this happens, the data is called linearly separable. 3 If you are even more lucky, then you have an algorithm that will find that \( {\bf w}\) and \( b\) for you.


🐉🐉 Perceptron algorithm. If you are handed a training dataset, what algorithm can you use to find \( {\bf w}\) and \( b\)? The Perceptron algorithm4 is one possibility. Some initial discussion is needed before giving the algorithm.

The expression \( {\bf w}\cdot{\bf x} + b\) can be rewritten as a dot product of two vectors. Let \(\bf W\) be the \((n+1)\)-component vector that agrees with \(\bf w\) in the first \(n\) components, and has \(b\) as its last component. Also, define \(\bf X\) as the \((n+1)\)-vector that agrees with \(\bf x\) in the first \(n\) components and has 1 as last component. Then \( {\bf w\cdot x} + b = {\bf W\cdot X}\).

Also, if \( {\bf X_1}, {\bf X_2},\ldots, {\bf X_m} \in \mathcal X^{train}\) are the training data (\((n+1)\)-vectors, as was just explained), then for each \(\bf X_i\) set its label \( y_i\) equal to \( +1\) or \( -1\) according to the correct label that is given. The goal, thinking the data is linearly separable, is to find a \( {\bf W}\) so that \( {\bf W}\cdot{\bf X_i} > 0\) if \( y_i > 0\) and \( {\bf W}\cdot{\bf X_i} < 0\) if \( y_i < 0\). In other words, to make \( y_i\ {\bf W}\cdot{\bf X_i} > 0\) for all \(i\).

With this in mind, the Perceptron algorithm finds \( {\bf W}\) iteratively. It says to do the following with the Input: \( \mathcal X^{train} = [{\bf X_1}, {\bf X_2},\ldots, {\bf X_m}]^{\top}, \quad\mathcal Y^{train} = [y_1,y_2,\ldots,y_m]\).

⁍ Initialize \( {\bf W}^{(1)}\) to be \( (0,0,\ldots,0)\).

⁍ On step \( t\), if there is an \( {\bf X_i}\) where \( y_i\ {\bf W}^{(t)}\cdot{\bf X_i}\) is not positive (strictly bigger than zero), then set \( {\bf W}^{(t+1)} = {\bf W}^{(t)} + y_i\ {\bf X_i}\), for that \( i\). Otherwise, output \( {\bf W = W^{(t)}}\).

If the data is linearly separable, then the Perceptron algorithm will stop after, at most, a number of iterations that depends on the training set \( {\bf x_1},{\bf x_2}, \ldots, {\bf x_m}\) with its labels \( y_1,y_2, \ldots, y_m\):

Theorem. Define \( R = \max_i |{\bf X_i}|\) and \( B =\min_{\bf V} \{|{\bf V}| : \forall \ i,\ y_i{\bf V}\cdot{\bf X_i} \ge 1\}\). Then the Perceptron algorithm stops after at most \( (RB)^2\) iterations, and when it stops with output \( {\bf W}\), then \( y_i {\bf W}\cdot{\bf X_i} > 0\) for all \( i\).

Idea of the proof: Write \( {\bf W^*}\) for a vector that realizes the minimum \( B\). IOW, \( |{\bf W^*}| = B\) and \( y_i {\bf W^*}\cdot{\bf X_i} \ge 1\) for all \( i\).

Using how \( {\bf W}^{(t+1)}\) is obtained from \( {\bf W}^{(t)}\), you can show that after \( T+1\) iterations, \( {\bf W^*}\cdot{\bf W}^{(T+1)} \ge T\). Also, you can show that \( |{\bf W}^{(T+1)}| \le \sqrt{T} R\) (for this, you use the condition on \({\bf W}^{(T)}\) that necesitates an update). This, with the Cauchy-Schwarz inequality, means that \( \frac{\sqrt{T}}{BR} \le 1\) which can be rearranged to \( T \le (BR)^2.\quad\quad \square\)


Example of a linearly separable dataset. Let’s look at a toy example in \( \mathbb R^2\). Let \( \mathcal X = [(-1,3), (-1,-1), (3,-1), (0,1.5)]^{\top}\) and \( \mathcal Y = [-1,-1,1,1]\).

Before beginning, note that you must make a choice of the order in which you check the dot products on the instance set. Different choices affect the number of iterations the algorithm goes through. It can also affect the particular output \( {\bf W}^{(T)}\). In this example, I will start with \(\bf X_1\) and proceed in order through the points in \(\mathcal X\). After I make an update to \({\bf W}^{(t)}\), I will not restart at the beginning of \(\mathcal X\). Instead, I will continue from the next point in the set, but I will make sure that all points in \(\mathcal X\) have been checked on the final \({\bf W}\).

On step 1 of the Perceptron algorithm (since \( {\bf W}^{(1)} = \vec{\bf 0}\)) you get \( {\bf W}^{(2)} = \vec{\bf 0} + y_1{\bf X_1} = (1,-3,-1)\) if you use \( {\bf X_1}\) first.

On the next step, checking the vector \( {\bf X_2}\), you get \( y_2{\bf W}^{(2)}\cdot {\bf X_2} = -1\). Since this is negative, we set \( {\bf W}^{(3)} = {\bf W}^{(2)} – (-1,-1,1) = (2,-2,-2)\). If you continue this way, checking the dot products in the order mentioned, then on the 10th step you find \( {\bf W}^{(10)} = (4, -0.5, 1)\), which separates the points.

 

The line in \( \mathbb R^2\) determined by \( {\bf W} = {\bf W}^{(10)}\) is \( 4x_1 – 0.5x_2 + 1 = 0\). This line is shown on the left.

In the next post, we will look at a famous dataset that has a part that is linearly separable from the rest — the Iris dataset.

 


  1. In the email example, think of \(\mathcal X\) as the space of vectors that are made from an email. For example, if 23 coordinates were made (from frequencies and related things), then the instance space is \(\mathbb R^{23}\).
  2. A choice is made about whether to assign an \(\bf x\) that satisfies \({\bf w}\cdot{\bf x}+b = 0\) a positive or negative label. For the purpose of how well the model will perform, it should not affect performance.
  3. Whether it happens or not can depend on the way you chose to represent the data in coordinates. Side observation: if data is linearly separable (by a hyperplane \( H\)) then the line segment between two points on the positive side of \( H\) and the line segment between two points on the negative side of \( H\) cannot possibly intersect. This provides a criterion to decide that your data is not linearly separable.
  4. Rosenblatt, F. (1958), “The perceptron: A probabilistic model for information storage and organization in the brain,” Psychological Review 65, 386—407. (Reprinted in Neurocomputing, MIT Press, 1988).