Implementing Linear Regression

In the last post I discussed how to find the LSR line for a set of points in the plane, that is, for a set of instances \(\mathcal X^{train} = [x_1,x_2,\ldots,x_m]^{\top}\) that are in \(\mathbb R\), with labels \(\mathcal Y^{train} = (y_1,y_2,\ldots,y_m)\) that are in \(\mathbb R\). Here I will describe a Python implementation of that process “from scratch,” meaning I will create the needed matrix and compute the coefficients for the regression from it. I will also show how to use built-in functions from the NumPy package.

The data I will use for the example comes from the 1985 Ward’s Automotive Yearbook from the UCI Machine Learning Repository. I will also briefly cover the coefficient \(R^2\), an alternative to the MSE for measuring how well the line fits the data.

Recall from the last post that the intercept \(\hat b\) and slope \(\hat w\) of the LSR line are the coordinates of the vector

\(\begin{bmatrix}\hat b \\ \hat w\end{bmatrix} = {\bf \hat x} = (A^{\top}A)^{-1}A^{\top}{\bf y}, \qquad\qquad (1)\)

where \(A\) is the matrix having \(m\) rows of the form \([1, x_i]\), for \(i=1,2,\ldots,m\).

In a Python environment, I import NumPy with abbreviation np, and say that I call my list of instances x_list (so this is a list containing \(x_1,x_2,\ldots\)). Then, I define A by

A = np.array([ [1, x] for x in x_list ])

Next, calculate \(A^{\top}A\) and store it in the variable AtA by calling

AtA = np.dot(A.transpose(), A)

The matrix \(A\) has only two columns and, in general, \(A^{\top}A\) is invertible if the columns of \(A\) are independent. This means that, as long as there is more than one unique number in x_list, then AtA will be invertible. Setting y_list to be the list of labels, I can get the solution in (1) with

x_hat = np.dot( np.dot( np.linalg.inv(AtA), A.transpose() ), y_list )
b_hat, w_hat = x_hat[0], x_hat[1]

I will show an example of how this works on two columns from the Automotive Yearbook dataset: the column containing the engine-size of the vehicle, and the column with the city-mpg, the miles per gallon for side-street travel.

I import the package Pandas with abbreviation pd, and I make a DataFrame called auto_df. The data has been made available as a CSV file,¹ with no header line, meaning we must create headers manually. Examining the Attribute Information section at the description page for the data, make a list called columns with a header name for each attribute — for the columns of interest, these names are 'engine-size' and 'city-mpg'.

A cursory glance at the data in auto_df indicates a few outliers. When I create x_list and y_list, I remove these outliers. This is shown below (the outliers are indices 55-58).

x_list = np.append(auto_df['engine-size'][:55], auto_df['engine-size'][59:])
y_list = np.append(auto_df['city-mpg'][:55],&amp;amp;amp;amp;amp;nbsp;auto_df['city-mpg'][59:])

Upon doing so x_hat can be found as described above. The solution has \(\hat b \approx 39.7\) and \(\hat w \approx -0.11\).

Scatter plot of engine-size (horizontal) to city-mpg (vertical), with LSR line.

To the right is a plot of the points (in black) that have coordinates from x_list and y_list along with a plot of the regression line \(\hat y = \hat w x + \hat b\) (in red) that was found in the previous section.

While there is clearly a negative correlation between the quantities in the plot, and the regression line has lowest MSE among lines, it doesn’t fit spectacularly well. In an effort to have a measure of error whose size is scale independent, we can consider what is called the “coefficient of determination,” \(R^2\), which is related to MSE and is between 0 and 1. To compute \(R^2\) from the MSE use the formula

\(R^2 = 1 – \dfrac{\sum_{i=1}^m(y_i – \hat y_i)^2}{\sum_{i=1}^m(y_i – \bar y)^2} = 1 – \dfrac{MSE}{v_y}\),

where \(\bar y\) is the average of \(y_i\) and \(v_y\) is the average squared distance of \(y_i\) to \(\bar y\).

When \(R^2\) is close to 1, the points are a better fit to the line. For this dataset, the MSE is 20.726 and the coefficient \(R^2\) is 0.508.

By using a higher degree polynomial, we can make \(R^2\) be closer to 1. To get a quadratic regression polynomial, define A so that it has three columns:

A = np.array([ [1, x, x**2] for x in x_list ])

Scatter plot of engine-size (horizontal) to city-mpg (vertical), with LSR quadratic polynomial.

With this matrix A, the solution x_hat (found in an identical manner) contains the three coefficients of the polynomial. In other words, you want to fit a quadratic polynomial \(\hat y = \hat ax^2+\hat bx + \hat c\) to the data, and x_hat, the solution to (1), is now a 3-component vector, such that x_hat = [c_hat, b_hat, a_hat]. The graph of the polynomial is shown to the right, in red. In this example, x_hat = [56.834, -0.342, 0.001], so the LSR quadratic polynomial is

\(\hat y = 0.001x^2 – 0.342x + 56.834\).

The computation of \(R^2\) comes out as \(R^2 = 0.615\).

Polynomial regression may also be carried out using the function polyfit from the NumPy package. The function takes as input the list of \(x\)-coordinates (called x_list above), the list of \(y\)-coordinates (called y_list above), and lastly the degree of the polynomial. So, for linear regression that number should be 1. To compute the quadratic polynomial that I computed above with a matrix, the command

 np.polyfit(x_list, y_list, 2)

will output the list [0.001, - 0.342, 56.834]. Note that the list of coefficients from polyfit begins with the coefficient of the highest exponent in the polynomial.

The file can be downloaded, or read directly using Pandas’ read_csv function, from the address ‘https://storage.googleapis.com/ml_universities/cars_dataset/cars_data.csv‘. In the read_csv call, the encoding option ‘latin1’ should work. ↩