Linear algebra refresher I

This post is the first in a series of posts on basic mathematical background that is important in Machine Learning (ML). While the posts are original (created by me), much of my decision-making regarding the order of topics will come from the book Mathematics for Machine Learning (by Deisenroth, Faisal, and Ong), which is available online in its entirety (in early 2020).

I am writing this, and the other posts, as refreshers on needed mathematics. They are not meant, by any means, to be a first introduction to the content. In this post, I will assume that the reader is familiar with matrix operations (matrix multiplication and the transpose, for example) and with Gaussian elimination.

Vector spaces and notation.

A (real) vector space is a set \(\mathsf V\) of objects (“vectors”) that have an addition operation and a scalar multiplication operation. An addition operation means that when you have two vectors, called \({\bf u}\) and \(\bf v\) say, then there is another vector that you call \({\bf u}+{\bf v}\). Importantly, \({\bf u}+{\bf v}\) must be in the set \(\mathsf V\). A scalar multiplication operation means that when you have a vector \({\bf v}\) and a real number \(\alpha\), then there is another vector called \(\alpha{\bf v}\) that is determined by these (it is also in \(\mathsf V\)). The addition and scalar multiplication operations have to satisfy some reasonable conditions (see the Axioms in the table here, think of the F in the table as \(\mathbb R\), the real numbers).

Example. The example that everyone encounters in their first linear algebra class: the vectors are ordered tuples, i.e. lists, of real numbers (each number in the list is called a component of the vector), the addition is defined by adding component-wise, and the scalar multiplication is defined by multiplying every component in the vector by that real number.

Here is an example of the addition and scalar multiplication in action, using vectors with 3 components: let \({\bf u} = (-2,0,1.5)\) and \({\bf v} = (1,3,1)\), and say the scalar \(\alpha = 1.2\). Then

\({\bf u}+{\bf v} = \begin{bmatrix}-2+1 \\ 0+3 \\ 1.5+1\end{bmatrix} = \begin{bmatrix}-1 \\ 3 \\ 2.5\end{bmatrix}\);
\(\alpha{\bf u} = 1.2\begin{bmatrix}-2\\ 0\\ 1.5\end{bmatrix} = \begin{bmatrix}-2.4 \\ 0 \\ 1.8\end{bmatrix}\)

I did not specify it above, but the intended vector space was \(\mathbb R^3\), containing all ordered triples of real numbers. For any integer \(n\geq 1\) there is a vector space called \(\mathbb R^n\) which is defined in this manner, but with vectors that have \(n\) components.

There are many other examples of vector spaces that appear both within mathematics and in applications (game theory in economics, for example). For the purposes of ML, however, we will mostly, though not always, stick to subspaces (defined below) of \(\mathbb R^n\). Even so, it is quite useful to be able to write and reason about these things with notation that does not require writing out the components.

Notation Ground Rules. Some conventions are in order. Lowercase, bold-face characters will denote vectors (like I did with \({\bf u}\) and \({\bf v}\) above). Scalars will also be lowercase, but not in bold; I will use Greek letters for the scalars, much of the time. Also, \(\mathsf V\) will denote the vector space containing all the vectors (for this post, at least). Other named sets of vectors (e.g. a basis that has been given a symbol) will also be in a sans serif font (e.g. \(\mathsf U\) or \(\mathsf B\)). Matrices will be uppercase and italicized (e.g. \(A\)).

By default, a vector will be a column vector (like those with brackets above). In Other Words (IOW), if it has \(n\) components, then it is essentially the same as an \(n\times 1\) matrix. This works nicely with matrix multiplication.

If I need to write out the components of a vector horizontally, within a line, I’ll use parentheses: \({\bf u} = (-2,0,1.5)\). However, you should still think of the vector as a column vector. To refer to a row vector, I’ll use either transpose notation, \((-2,0,1.5)^\top\), or will use \([-2,0,1.5]\).

If I write \([{\bf u_1}, {\bf u_2}, \ldots, {\bf u_k}]\), then this is a horizontal array of \(k\) column vectors. IOW, it is a matrix with \(k\) columns, the columns coming from each \({\bf u_i}\). ¹ (Notice the square brackets, for matrix.)

Linear combinations. Say that \(\mathsf{U}=\{{\bf u_1}, {\bf u_2},\ldots,{\bf u_k}\}\) is a set of vectors in \(\mathsf V\). A linear combination of vectors in \(\mathsf U\) is a vector which equals

\(\sum_{i=1}^k\alpha_i{\bf u_i} = \alpha_1{\bf u_1}+\alpha_2{\bf u_2}+\ldots+\alpha_k{\bf u_k}\),

where each \(\alpha_i\) is some scalar (maybe even 0). The span of \(\mathsf U\) is written \(\text{Span}(\mathsf U)\), or \(\text{Span}({\bf u_1},{\bf u_2},\ldots,{\bf u_k})\), and is the set of all linear combinations of those vectors — that is, the vectors that can found to be equal to a linear combination of the ones in \(\mathsf U\). The vector \(\vec{\bf 0}\), which has 0 for every component, is always in the span — choose every \(\alpha_i\) to be 0.

Example. Like in the first example, let \({\bf u} = (-2,0,1.5)\) and \({\bf v} = (1,3,1)\) be vectors in \(\mathbb R^3\). Then \((-5, 9, 9)\) is a linear combination of \({\bf u}\) and \({\bf v}\) since

\(\begin{bmatrix}-5 \\ 9\\ 9\end{bmatrix} = 4\begin{bmatrix}-2 \\ 0\\ 1.5\end{bmatrix} + 3\begin{bmatrix}1 \\ 3\\ 1\end{bmatrix}.\)

It is often useful to think of a linear combination as the result of multiplying a matrix and a vector. That is, we have equality

\(\nu_1{\bf a_1}+\nu_2{\bf a_2}+\ldots+\nu_k{\bf a_k} = A{\bf v}\),

where \(A\) is the matrix \([{\bf a_1},{\bf a_2},\ldots,{\bf a_k}]\) and \({\bf v}=(\nu_1,\nu_2,\ldots,\nu_k)\). In terms of the example that was just done,

\(4\begin{bmatrix}-2 \\ 0\\ 1.5\end{bmatrix} + 3\begin{bmatrix}1 \\ 3\\ 1\end{bmatrix} = \begin{bmatrix}-2 & 1 \\ 0 & 3\\ 1.5 & 1\end{bmatrix}\begin{bmatrix}4 \\ 3\end{bmatrix}\).

Linear independence, basis.

A set of vectors \(\mathsf U\) is called linearly independent if the only way to get \(\vec{\bf 0}\) in \(\text{Span}(\mathsf U)\) is by setting all the scalars to 0:

\(\sum_{i=1}^k\alpha_i{\bf u_i}=\vec{\bf 0}\qquad\Rightarrow\qquad\alpha_i=0\text{ for all }i.\)

Otherwise (if there is a different way to get \(\vec{\bf 0}\)), the set is called linearly dependent. There are a number of equivalent ways to think about linearly (in)dependent sets of vectors. One of them is that the set is linearly dependent if and only if one of the vectors in the set is contained in the span of the others.

Exercise. Explain why \(\left\{\begin{bmatrix}2 \\ 1\\ 1\end{bmatrix}, \begin{bmatrix}0 \\ 2\\ 1\end{bmatrix}, \begin{bmatrix}0 \\ 0\\ 1\end{bmatrix}\right\}\) is linearly independent.

Exercise. Explain why saying that “the only solution to \(A{\bf v} = \vec{\bf 0}\) is \({\bf v}=\vec{\bf 0}\)” is the same as saying “the columns of \(A\) make a linearly independent set.”

When \(\mathsf V=\mathbb R^n\), then a linearly independent set can have at most \(n\) vectors in it. For example,

\(\left\{\begin{bmatrix}2 \\ 1\end{bmatrix}, \begin{bmatrix}-1 \\ 3\end{bmatrix}, \begin{bmatrix}100 \\ 0\end{bmatrix}\right\}\)

is linearly dependent. It cannot be independent since the vectors are in \(\mathbb R^2\), but there are 3 of them.

Basis of a space. Using \(\mathsf U\) again to denote a set of vectors in a vector space \(\mathsf V\), \(\text{Span}(\mathsf U)\) is a vector space on its own (using the operations that you had already from \(\mathsf V\)): sums and scalar multiples stay within the span, and \(\vec{\bf 0}\) is in the span. This is called a subspace of \(\mathsf V\). Usually, even when working with vectors in \(\mathbb R^n\) for some \(n\), you really care about working in a subspace.

If \(A\) is an \(m\times n\) matrix, then the set of all vectors in \(\mathbb R^m\) which are equal to \(A{\bf v}\) for some \({\bf v}\), is a subspace of \(\mathbb R^m\). It is the span of the columns of \(A\) (remember the comment above about linear combinations and multiplying by a matrix), and it’s often called the column space of \(A\).

Recall the null space of \(A\), which is the set of all \({\bf v}\) so that \(A{\bf v} = \vec{\bf 0}\). This is a subspace of \(\mathbb R^n\) (when \(A\) is \(m\times n\)), though at this point in the discussion it is not readily apparent how to find a set of vectors whose span equals it.

If \(\text{Span}(\mathsf U)\) is the entire vector space \(\mathsf V\), then call \(\mathsf U\) a generating set (or spanning set) of vectors for \(\mathsf V\). The set \(\mathsf U\) is a basis if it is both linearly independent and a generating set. Informally, a basis is a generating set that has no “redundancy,” meaning that if you took out any vector from that basis, it would no longer be a generating set for \(\mathsf V\).

Example. Let \({\bf b_1}=(1,-1)\) and \({\bf b_2}=(1,2)\). Then \(\{{\bf b_1}, {\bf b_2}\}\) is a basis for \(\mathbb R^2\).

If you remember, an \(n\times n\) matrix having \(n\) pivot columns means that the equation \(A{\bf x} = {\bf y}\) has a solution, regardless of which \(n\)-vector \({\bf y}\) is on the right. This also means the null space just contains the zero vector (the only solution to \(A{\bf x} = \vec{\bf 0}\) is \({\bf x} = \vec{\bf 0}\)). The statement about \(A{\bf x}={\bf y}\) having a solution means the columns generate \(\mathbb R^n\). The statement about the null space means that the columns are linearly independent, and so if an \(n\times n\) matrix has \(n\) pivots, then its columns are a basis for \(\mathbb R^n\).
Doing elimination on the matrix \([{\bf b_1}, {\bf b_2}]\) gives

\(\begin{bmatrix}1 & 1\\ -1 & 2\end{bmatrix} \leadsto \begin{bmatrix}1 & 1\\ 0 & 3\end{bmatrix}\),

so this \(2\times 2\) matrix has two pivots.

In \(\mathbb R^n\), the vector \({\bf e_i}\) is the one whose components are all zero except in position \(i\), and the component in position \(i\) is 1. The set \(\{{\bf e_1}, {\bf e_2},\ldots,{\bf e_n}\}\) is called the standard basis of \(\mathbb R^n\). ² This is the basis with respect to which we typically work in \(\mathbb R^n\). However, for each \(n\), there are many different choices of basis (take the columns of any invertible \(n\times n\) matrix). Using a different, not standard, basis can often be quite helpful.

Every vector space has a basis (and so does every one of its subspaces — remember, they are a vector space also). Moreover, if for some number \(d\) there is a basis of \(\mathsf V\) that has \(d\) vectors in it, then every basis of \(\mathsf V\) has exactly \(d\) vectors in it. ³ This number is called the dimension of \(\mathsf V\).

Remember that the null space of a matrix \(A\) is the set of all \(\bf v\) so that \(A{\bf v} = \vec{\bf 0}\). Remind yourself how to find a basis of the null space of a matrix. (Some help. ⁴)

Exercises. Find a basis for the null space of \(\begin{bmatrix} 1 & 1 & 1\\ 1 & 2 & 4 \\ 1 & 3 & 7 \\\end{bmatrix}\).

Find a basis for the plane \(3x – y + 6z = 0\) in \(\mathbb R^3\) (think of how the equation defines a null space).

Components in a given basis. Let \(\mathsf B=\{{\bf b_1}, {\bf b_2},\ldots,{\bf b_k}\}\) be a basis for \(\mathsf V\). Remember that \(\mathsf B\) is a generating set, and so every \({\bf v}\) in \(\mathsf V\) is in \(\text{Span}(\mathsf B)\). That is, fix a vector \({\bf v}\). There is a choice of \(\alpha_i\), for \(1\le i\le k\), so that

\({\bf v} = \sum_{i=1}^k\alpha_i{\bf b_i}\).

This would be true of any generating set, but since \(\mathsf B\) is linearly independent, there is only one choice for those \(\alpha_i\) scalars. Once \({\bf v}\) is chosen (as well as the basis), there is no ambiguity as to what each \(\alpha_i\) will be, and we call these the components (or coordinates) of \({\bf v}\) in the basis \(\mathsf B\).

Exercise. Let \({\bf e_1}=(1,0)\) and \({\bf e_2}=(0,1)\) be the standard basis vectors in \(\mathbb R^2\). Let \(\mathsf B=\{(7,-1), (3,1)\}\), which is a basis of \(\mathbb R^2\). Find the components of \({\bf e_1}\) in the basis \(\mathsf B\), and the components of \({\bf e_2}\) in the basis \(\mathsf B\).

Linear mappings and change of basis.

Remember that linear transformations (linear mappings) are functions from one vector space to another (or possibly the same vector space) which “respect” linear combinations. For example, say \(f\) is a linear mapping from \(\mathbb R^{3}\) to \(\mathbb R^{10}\).⁵ Then for any \(\bf u\), \(\bf v\) \(\in\mathbb R^3\), and any scalars \(\alpha, \beta\), it is true that

\(f(\alpha{\bf u} + \beta{\bf v}) = \alpha f({\bf u}) + \beta f({\bf v})\).

For example, \({\bf x}=(3,2,-3)\) is a linear combination of \({\bf u}=(1,0,-1)\) and \({\bf v}=(0,1,0)\). So, if you know what \(f({\bf u})\) and \(f({\bf v})\) are, then

\(f({\bf x}) = f(3{\bf u}+2{\bf v}) = 3f({\bf u}) + 2f({\bf v})\),

and so you know where \(f\) must send \(\bf x\) based on what you already know.

Linear mappings make a very important class of functions on vector spaces; they get used a lot. The fact that they respect linear combinations means that the entire linear mapping is determined by where it sends vectors in a basis. Let \(f:\mathsf{V}\to\mathsf{W}\) be a linear mapping from \(\mathsf V\) to \(\mathsf W\). Choose a basis \(\mathsf B\) of \(\mathsf V\), and a basis \(\mathsf C\) of \(\mathsf W\). For each vector \(\bf b_j\) in \(\mathsf B\), \(f({\bf b_j})\) has components in the basis \(\mathsf C\). Record these: say that \(\mathsf C = \{{\bf c_1}, \ldots, {\bf c_m}\}\) and that \(\mathsf B=\{{\bf b_1},{\bf b_2},\ldots,{\bf b_n}\}\). For each \(j=1,\ldots,n\), suppose that

\(f({\bf b_j}) = a_{1,j}{\bf c_1}+a_{2,j}{\bf c_2}+\ldots+a_{m,j}{\bf c_m}\).

IOW, use \(a_{i,j}\) to denote the component of \(f({\bf b_j})\) corresponding to \(\bf c_i\). Let \(A^{CB}_f\) be the \(m\times n\) matrix with entries \(a_{i,j}\). The mapping \(f\) is completely described by taking, for each \({\bf v}\) in \(\mathsf V\), the tuple of components for \(\bf v\) in the basis \(\mathsf B\), and then multiplying this tuple, as a coordinate vector, by \(A^{CB}_f\). The coordinate vector you get as a result is the tuple of components of \(f({\bf v})\) in the basis \(\mathsf C\). This description is unambiguous because, as was pointed out, the components, once a basis is fixed, are unambiguous.

When the linear mapping is from \(\mathbb R^n\) to \(\mathbb R^m\), for some \(m, n\), and the bases are the standard bases on each side, then we will simply write \(A_f\) for the matrix.

Exercise. Using \(\bf e_1, e_2, e_3\) for the standard basis vectors in \(\mathbb R^3\), write down the matrix \(A_p\) associated to the linear mapping that satisfies:

\(p({\bf e_1})=0.5{\bf e_1}+0.5{\bf e_3}\),
\(p({\bf e_2}) = \bf e_2\),
\(p({\bf e_3}) = 0.5{\bf e_1}+0.5{\bf e_3}\).

The mapping here is \(p:\mathbb R^3\to\mathbb R^3\), and you should use the standard basis for the “\(\mathsf B\)” and “\(\mathsf C\)” from above. ⁶

To get you started, note that the first column of the matrix should be the vector \((0.5,0,0.5)\).

Change of basis. You often want to be able to change the basis being used to represent the matrix. Say that \(\mathsf B_1\) is a basis of \(\mathsf V\) and \(\mathsf C_1\) a basis of \(\mathsf W\), and \(f:\mathsf V\to\mathsf W\) a linear mapping. If \(\mathsf B_2\) and \(\mathsf C_2\) are other bases (for \(\mathsf V\), \(\mathsf W\) respectively), then you can get the matrix \(A^{C_2B_2}_f\) from the matrix \(A^{C_1B_1}_f\) as follows.
The identity map \(id\) (taking every vector in \(\mathsf V\) to itself) has a matrix that switches basis representations, from \(\mathsf B_2\) to \(\mathsf B_1\). Note that this does not mean it takes a vector \(\bf b_j\) in \(\mathsf B_2\) to some vector in \(\mathsf B_1\). The vector \(\bf b_j\) is taken to itself, like all other vectors, but you figure out the components of \(\bf b_j\) in the basis \(\mathsf B_1\), and that is the column of the matrix corresponding to \(\bf b_j\). The notation would be \(A^{B_1B_2}_{id}\), but abbreviate this notation to \(I_B\). Likewise, for shorthand, write \(I_C\) for the matrix corresponding to the identity function on \(\mathsf W\) which switches from basis \(\mathsf C_2\) to basis \(\mathsf C_1\). Then

\(A^{C_2B_2}_f = I_C^{-1}A^{C_1B_1}_fI_B\).

Exercise. Let \(p\) be the mapping discussed in the previous exercise. Use the formula just found to get the matrix \(A^{B_2B_2}_p\), where

\(\mathsf{B_2} = \begin{bmatrix}1\\ 0\\ 1\end{bmatrix}, \begin{bmatrix}0\\ 1\\ 0\end{bmatrix}, \begin{bmatrix}-1\\ 0\\ 1\end{bmatrix}\).

Hint: Writing \(\mathsf B_1\) to mean the standard basis, the matrix \(I_B\) in this case would be equal to \(\begin{bmatrix}1&0&-1\\ 0&1&0\\ 1&0&1\end{bmatrix}\).

I’ll only do this when there are understood components to the vectors \({\bf u_i}\). ↩
It is easy to check this a basis. ↩
This makes the number \(d\) a property of the vector space, rather than the particular basis. ↩
Let \(A’\) be the reduced row echelon form of the matrix. For each non-pivot column of \(A’\), make a vector with 1 in this column’s position, with a variable for each position of a pivot column left of this non-pivot, and with 0 in all other components. Start with the closest pivot, use the row in \(A’\) with its 1 and solve for the variable so that the product with this row is zero. Then move up a row to the next pivot and do the same. Find values of all variables this way. These vectors for each non-pivot is a basis for the null space. ↩
In function notation, \(f:\mathbb R^3\to\mathbb R^{10}\) is linear. ↩
This mapping is a projection to a plane in \(\mathbb R^3\). ↩