Linear regression

In previous posts, we discussed binary classification by using a basic model — the half-space model (or its probabilistic counterpart, logistic regression). In this post we will discuss the most basic type of regression — simple linear regression.


Linear Regression. 1

The goal of simple (one-variable) linear regression is the following. Given a set of points \((x_1,y_1), (x_2,y_2), \ldots, (x_m,y_m)\) in the \(xy\)-plane that are “near” a line (one thinks), you want a procedure to produce a line that best fits the points, in the sense that the mean squared error (MSE) is minimized. I’ll explain mean squared error below. This best fit line is called the Least Squares Regression (LSR) line.

Points “near” a line

As a running example, consider the points plotted in the image to the right. There are 50 of them. Think of these as the points \((x_i,y_i)\) where \(i\) is between 1 and \(m\) (so \(m=50\)). How do we find the LSR line that fits best?

First, I will describe a procedure that works. Then I’ll say in which sense a line would be called the best fit, after which, I will explain why the procedure does result in the line of best fit. This will lead to being able to generalize the ideas (🐉🐉), which I discuss at the end. Before that, I put this in the context of ML and getting a model \(h:\mathcal X \to\mathcal Y\) to predict labels for instances.


Finding the LSR line.

(🐉 A bit of linear algebra knowledge is needed to understand the procedure.) Using the 50 point example as a reference, if the points were honestly on a line with some slope \(w\) and some intercept \(b\), then the equation \(y_i = wx_i + b\) would hold for all 50 points \((x_i,y_i)\). For example, suppose that \((x_1,y_1) = (0.01, 0.6)\). Then, in this case (the points all on \(y=wx+b\)), it would have to be true that \(0.6 = 0.01w + b\). You would have such an equation for each point, and these 50 equations being true would say that every point is on the line with slope \(w\) and intercept \(b\). In the running example, it is visually clear that there is no such line.2

You can rephrase the statement about 50 equations being true as one matrix equation. Define a \(50\times 2\) matrix \(A\) and a vector \({\bf y}\) as follows:

\(A = \begin{bmatrix}1 & x_1 \\ 1& x_2 \\ \vdots & \\ 1 & x_{50} \end{bmatrix}\);         \({\bf y} = \begin{bmatrix}y_1 \\ y_2 \\ \vdots \\ y_{50}\end{bmatrix}\)

(In general, however many points you have, that is the number of rows of \(A\) and of \({\bf y}\)). Then the 50 equations can be written as one matrix equation (where you would solve for \(b\) and \(w\)):

\(A\begin{bmatrix}b \\ w\end{bmatrix} = {\bf y}\).

As I said, a solution for this equation does not exist (not for our running example), since no line contains all the points. So, instead of trying to solve the equation, you replace \(\bf y\) with a vector \({\bf \hat{y}}\) that does allow for a solution, and you pick \({\bf \hat y}\) as close as possible to \({\bf y}\). In linear algebra terms, the vector \({\bf y}\in\mathbb R^{50}\) is not in the column space of \(A\). So we replace \(\bf y\) with the vector in the column space that is closest to it — that vector is \({\bf \hat y}\).

To do so, calculate the \(2\times 2\) matrix \(A^{\top}A\) and solve the equation3

\(A^{\top}A\begin{bmatrix}b \\ w\end{bmatrix} = A^{\top}{\bf y}\).         (1)

LSR line with data points; slope = 0.686; intercept = 1.124.

This equation always has a solution. Call it \(\begin{bmatrix}\hat b \\ \hat w\end{bmatrix}\). The 50-component vector you get by multiplying this solution by \(A\) will be the vector \({\bf \hat y}\),

\(A\begin{bmatrix}\hat b \\ \hat w\end{bmatrix} = {\bf \hat y}\).

The LSR line has slope and intercept given by that solution: \(\hat y = \hat w x + \hat b\). In the running example, the solution comes out as \(\hat b \approx 1.124\) and \(\hat w \approx 0.686\). This LSR line is plotted in red.


Mean Squared Error (MSE).

Given the set of data points and some line (any line), write it as \(\bar y = \bar w x + \bar b\), the MSE for that line is the average of the squared errors that come from the line’s prediction \(\bar y_i=\bar wx_i+\bar b\), compared to the observed value \(y_i\). For example, if the number of data points is 50, then the MSE equals

\(\frac{1}{50}\sum_{i=1}^{50} (\bar y_i – y_i)^2\).

A line is said to be the best fit to a set of data points if, among all lines, it has the minimum MSE. The line given by the procedure above, with slope \(\hat w\) and intercept \(\hat b\), is the one with minimum MSE. This perspective, that a MSE can be calculated for any arbitrary line4 in the plane, and the LSR line minimizes it, will be brought up multiple times. As a “first example,” we will see how it follows the pattern for how to go about using the data to select a particular predictor (the particular LSR line) within a class of possible predictors (all non-vertical lines).

The reason why the LSR line minimizes the MSE comes up in the next section, where I will also explain why the solution \((\hat b, \hat w)\) always exists.


Explanation of the algorithm.

(🐉🐉 Heavy on the linear algebra.)

The matrix \(A\) that was defined above had 50 rows and 2 columns. To speak a little more generally, let’s say the number of rows (data points) is \(m\) so that \(A\) is an \(m\times 2\) matrix. The set of vectors \({\bf b}\) for which

\(A{\bf x} = {\bf b}\)         (2)

has a solution5 is called the column space \(\mathsf C(A)\) of the matrix (\({\bf b}\) is a vector in \(\mathbb R^m\)). In the setup of the problem, \({\bf y}\) is not in the column space, but it is in \(\mathbb R^m\) (see the schematic picture below).

The projection of a vector to the column space of a matrix.

There is another set of vectors in \(\mathbb R^m\) corresponding to the matrix — the null space of \(A^{\top},\ \mathsf N(A^{\top})\). These are the vectors \({\bf x}\) such that \(A^{\top}{\bf x}\) is the zero vector. It is an important fact that the vectors in \(\mathsf N(A^{\top})\) are exactly those vectors that are orthogonal to the ones in \(\mathsf C(A)\).6 A consequence of this fact is that every vector \({\bf y}\) in \(\mathbb R^m\) can be written uniquely as a vector in \(\mathsf N(A^{\top})\) plus a vector in \(\mathsf C(A)\). So there is a \({\bf p}\) in \(\mathsf C(A)\) and a \({\bf q}\) in \(\mathsf N(A^{\top})\) so that

\({\bf y} = {\bf p} + {\bf q}\)    or    \({\bf p} = {\bf y} – {\bf q}\).

Since \({\bf p}\) is in \(\mathsf C(A)\), there is a vector \({\bf \hat x}\) so that \(A{\bf \hat x} = {\bf p} = {\bf y} – {\bf q}\). Furthermore, since \({\bf q}\) is in \(\mathsf N(A^{\top})\), I can multiply by \(A^{\top}\) to get

\(A^{\top}A{\bf \hat x} = A^{\top}{\bf y}\).

IOW, there is a solution \({\bf \hat x}\) to the equation (1). Recall that this was the vector with coordinates that gave the intercept and slope of the LSR line, \({\bf \hat x} = (\hat b, \hat w)\).

Thinking about how \({\bf p}\) relates to \({\bf y}\) and \(\mathsf C(A)\) tells us why the LSR line minimizes the MSE. First of all, \({\bf q}\) is orthogonal to every vector in \(\mathsf C(A)\), since it is in \(\mathsf N(A^{\top})\). That means \({\bf p} = \text{proj}_{\mathsf C(A)}({\bf y})\), it is the orthogonal projection of \(\bf y\) to \(\mathsf C(A)\). And so, it is the closest point in \(\mathsf C(A)\) to \({\bf y}\). Finally, the MSE of any line \(\bar wx+\bar b\) equals \(1/m\) times the distance from \(\bf \bar y\) to \(\bf y\):

\(\frac{1}{m}\sum_{i=1}^m (\bar y_i – y_i)^2 = \frac{1}{m}|{\bf \bar y} – {\bf y}|^2\).

In particular, this is true for the projection \({\bf\hat y} = A{\bf \hat x} = {\bf p}\), and its distance to \(\bf y\) is smallest among vectors in \(\mathsf C(A)\). Therefore, the line \(\hat y = \hat w x+\hat b\) has the smallest MSE among possible lines.


Relation to Machine Learning.

Consider the \(m\) points we started with as a set of training instances (training data), with labels. The data in this case are \(\mathcal X^{train} = [x_1, x_2, \ldots, x_m]^{\top}\) and the labels are \(\mathcal Y^{train} = (y_1,y_2,\ldots,y_m)\). Both instances and labels in this case are in \(\mathbb R\), so the data matrix has a single column. As an example, the data could be the price of oil per barrel, recorded at some times, and you have the average airfare (of a particular flight path) at those times — your \(y_i\)’s in \(\mathcal Y^{train}\). You want to model a (rough) linear relationship for how airfare changes according to oil prices.

In notation we have used, \(\mathcal X\) would be the set of all possible oil prices, having \(\mathcal X^{train}\) as the ordered subset to which you have access. Likewise, \(\mathcal Y\) would be the corresponding airfares — each one being tied to an oil price in \(\mathcal X\) — and \(\mathcal Y^{train}\) is the list corresponding to \(\mathcal X^{train}\). The idea (hypothesis, maybe) is that there is roughly a linear relationship between \(\mathcal X\) and \(\mathcal Y\). The LSR line produces a function \(h: \mathcal X \to\mathcal Y\) that attempts to predict the label in \(\mathcal Y\) for each point in \(\mathcal X\), the prediction, for this linear model, being the height on the line, \(\hat y = h(x) = \hat wx+\hat b\). The procedure finds the parameters, \(\hat w\) and \(\hat b\), which make the prediction be the best possible for a linear model (in terms of MSE), according to the available data \(\mathcal X^{train}, \mathcal Y^{train}\).

The approach to linear regression given here benefits greatly from linear algebra, in a way that is difficult to duplicate in other (not linear regression) learning settings. In other cases, it is often necessary to apply some kind of iterative procedure in order to improve, step by step, the choice of parameters for the model.


Generalizing simple linear regression.

During the Explanation of the algorithm section, I talked about the matrix \(A\) as having 2 columns (as it does when the instances \(x_i\) are in \(\mathbb R\)). However, the discussion did not require that \(A\) have just 2 columns. It works just as well if \(A\) has more columns, and there is only one solution vector \({\bf \hat x}\) as long as the columns of \(A\) are linearly independent (when the rank of \(A\) equals the number of columns).

What about the model (what kind of regression do we have?) The vector \({\bf \hat x}\) now represents more coefficients, instead of just an intercept and slope, of a model function.

One interpretation is to have multiple features in the instances. For example, perhaps each \({\bf x_i}=(x_{i,1}, x_{i,2}, x_{i,3})\) is an ordered triple — maybe the first coordinate is the price of oil, the second is the price of nickel, the third is the price of aluminum. You could model the label (the price of airfare, for example) as a linear function \(\hat y_i = {\bf \hat w}\cdot{\bf x_i} + \hat b\) by making a matrix with 4 columns:

\(A = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & x_{1,3} \\ \vdots & & & \\ 1 & x_{m,1} & x_{m,2} & x_{m,3}\end{bmatrix}.\)

The same procedure as before will produce a vector \((\hat b, \hat w_1, \hat w_2, \hat w_3)\) that determines an affine function, which will minimize the MSE among affine functions of the same type. That is, the linear regression procedure gives the coefficients for the function with lowest mean squared error among functions of the type \(y = b + w_1x_1+w_2x_2+w_3x_3\).

An alternative way that more columns could arise is polynomial regression. Returning to an instance-label set where both \(\mathcal X\) and \(\mathcal Y\) are in \(\mathbb R\), you can model the relationship with a polynomial. Say that you are trying to fit \(y_i = b_0 + b_1x_i + b_2x_i^2 + \cdots + b_nx_i^n\). Then you can fill out the columns of \(A\) with the powers of \(x_i\) (so row \(i\) of \(A\) looks like \(1, x_i, x_i^2, \ldots, x_i^n\)). The same procedure as above produces coefficients of a degree \(n\) polynomial. This polynomial function can then be the model function \(h\). The MSE is defined identically, and among all polynomials of degree \(n\), the polynomial you get minimizes the MSE.

It is important to avoid overfitting the data. Think of overfitting, for now, as meaning that you are trying to fit variation in the (training) data too closely. Doing so gives a model that might be very accurate on the training data, but at a price. It can produce quite high error when using the model on future (non-training) data.

In polynomial regression, an extreme case is when \(m\), the number of data points, is only one more than the degree of the polynomial (and none of the \(x\)-coordinates are equal). In this case the polynomial fits the data exactly — each point is on the graph of the polynomial — so the MSE is zero. While that seems a good thing, future data will not be on that graph, with near certainty, and will likely not even be close to it. We’ll return to this when talking about empirical risk and the Bias-Complexity trade-off.


 

  1. I would guess that most people reading this post have seen linear regression in some context. Just in case, I will discuss it as if you haven’t heard of it before. Perhaps I will present a fresh perspective that is productive for you.
  2. To make this rigorous it is enough to find two separate pairs of points so that the two slopes you get, one for each line segment connecting points in a pair, are different.
  3. The matrix \(A^{\top}\) is the transpose of matrix \(A\): what you get by flipping rows to columns and vice-versa. The matrix \(A^{\top}A\) will be invertible if the columns of \(A\) are independent.
  4. Except for a vertical line, that is.
  5. Beware! The vector \({\bf x}\) in this equation is completely unrelated to the instances of the modeling problem \(x_1,\ldots,x_m\). I simply used \({\bf x}\) because you want to solve for this vector.
  6. By flipping rows and columns (taking the transpose on a product), saying that \(A^{\top}\begin{bmatrix}x_1 \\ \vdots \\ x_m\end{bmatrix}\) is the zero vector is the same as saying \(\lbrack x_1,\ldots,x_m\rbrack A\) is the zero vector, and this latter product is the set of dot products of \(\bf x\) with the columns of \(A\).