One of my research projects right now, in collaboration with a student, involves a group-theoretic construction called a wreath product. Given two groups (in the algebraic sense), the wreath product of these groups is a new group constructed from them. In this post, I wish to discuss the order of elements in the wreath product and consider how this might obstruct groups from being isomorphic to a (non-trivial) wreath product.
Group actions and wreath products
We will have a group \(H\) that acts on a set \(\Omega\). That is, we have a function \(H\times\Omega \to \Omega\) where, if for each \(h\in H\) and \(\omega\in\Omega\) we write \(h\cdot\omega\) for the image of \((h, \omega)\) under the function, then the function satisfies:
\(1_H\cdot\omega = \omega,\quad\text{for all }\omega\in\Omega;\)
\((hh’)\cdot\omega = h\cdot(h’\cdot \omega),\quad\text{for all }h,h’\in H\ \text{and}\ \omega\in\Omega.\)
Equivalently, we may define \(\varphi: H \to \textrm{Sym}(\Omega)\) by setting \(\varphi(h)\), for each \(h\in H\), to be the bijection determined by \(\varphi(h)(\omega) = h\cdot\omega\) for \(\omega\in\Omega\). Being a group action means that \(\varphi\) is a well-defined group homomorphism.
Let \(G\) and \(H\) be groups, where \(H\) acts on a set \(\Omega\). For each \(h\in H\), this action associates to \(h\) an automorphism \(\alpha_h\) of the Cartesian product \(\prod_{\omega\in\Omega} G\) through the action on the indices. In other words, given \((g_{\scriptsize\omega})_{\omega\in\Omega}\), with each \(g_{\scriptsize\omega} \in G\), the automorphism associated to \(h\in H\) is given by
With the automorphism on the Cartesian product \(\prod_{\omega\in\Omega} G\) associated to each element of \(H\) understood, the wreath product \(G \wr H\) of groups \(G\) and \(H\) is defined as the corresponding semi-direct product of \(\prod_{\omega\in\Omega} G\) and \(H\). Stated explicitly, elements of \(G \wr H\) have the form \((\bar g, h)\), where \(\bar g = (g_{\scriptsize\omega})_{\omega\in\Omega}\) is an element of the Cartesian product of \(G\) over \(\Omega\), and \(h\in H\); the operation on \(G \wr H\) is then
Note that the inverse of \((\bar g, h)\) in the wreath product is equal to \(\left(\alpha_{h^{-1}}((\bar g)^{-1}), h^{-1}\right)\).
Letting \(\{1\}\) denote the trivial group, note that \(\{1\} \wr H \cong H\) for every group \(H\) (and a group action on any set \(\Omega\)). Let’s call this a trivial wreath product since you simply get the outer group that you began with. If \(H = \{1\}\), then, since the automorphism \(\alpha_1\) will be the identity, we get that \(G \wr H \cong \prod_{\Omega}G\) by the map \((\bar g, 1) \mapsto \bar g\).
Inhomogeneous wreath products
I want to consider a generalized version of the wreath product. Suppose that \(\Omega\) is partitioned into \(m\) subsets, \(\Omega = \Omega_1\cup\ldots\cup\Omega_m\), and there is a group action of \(H\) on \(\Omega_i\) for each \(i=1,2,\ldots,m\). Let \(G_1, G_2,\ldots, G_m\) be \(m\) groups. The corresponding inhomogeneous wreath product \((G_1,G_2,\ldots,G_m)\wr\wr H\) is defined to be the semi-direct product of \(P\) and \(H\) where
Here, the automorphism of \(P\) associated to \(h\in H\) is that obtained by simultaneously using the \(m\) automorphisms on the product of each \(G_i\) over \(\Omega_i\): \(\alpha_h( (\bar g_1, \bar g_2, \ldots, \bar g_m) ) = (\alpha_h(\bar g_1), \alpha_h(\bar g_2), \ldots, \alpha_h(\bar g_m)). \)
If we have one group \(G\) such that \(G_i \cong G\) for all \(i=1,\ldots,m\), then we have that \((G_1,G_2,\ldots,G_m)\wr\wr H\) is isomorphic to \(G\wr H\), using the action of the outer group \(H\) on \(\Omega\).
For \(i=1,\ldots,m\), suppose that \(s_i = |\Omega_i|\) is the cardinality of \(\Omega_i\). Note that
Orders of Elements and Order of the Wreath Product
Given a finite group \(G\), consider the set \(\{n\in\mathbb N\ |\ \textrm{ord}(g)|n,\ \text{for all }g\in G\}\) and define \(n_G\) to be the minimum element of this set.1
Lemma. Let \(G_1, G_2,\ldots, G_m\) be finite groups and, for convenience, write \(n\) for the least common multiple of the \(n_{G_i}\). Let \(H\) be a finite group and \(\Omega\) a set so that there is a partition \(\Omega = \Omega_1\cup\Omega_2\cup\ldots\cup\Omega_m\) such that \(H\) acts on \(\Omega_i\) for each \(i=1,2,\ldots,m\). If \(x \in (G_1,G_2,\ldots,G_m)\wr\wr H\) then \(x^{nn_H} = 1\).
Proof. Consider an element \(x = ((\bar g_1,\bar g_2,\ldots,\bar g_m), h)\) in \((G_1,G_2,\ldots,G_m)\wr\wr H\). Note that \(h^{n_H} = 1_H\).
Now, for each \(i=1,2,\ldots,m\) and \(\omega\in\Omega_i\), since all components of \(\bar g_i\) are in \(G_i\), there exist elements \(g_{i,\omega}^{(h)} \in G_i\) so that, writing \(\bar g_i^{(h)}\) for \((g_{i,\omega}^{(h)})_{\omega\in\Omega_i}\), we have \(x^{n_H} = ((\bar g_1^{(h)}, \bar g_2^{(h)}, \ldots, \bar g_m^{(h)}), 1_{H})\). Now, since \(1_{H}\) acts trivially on each \(\Omega_i\), we have that
If \(k\) is a multiple of \(n_{G_i}\) then \((g_{i,\omega}^{(h)})^k\) is the identity in \(G_i\), for any \(\omega\in\Omega_i\). Thus, in the Cartesian product, \((\bar g_{i}^{(h)})^n\) is the identity, for every \(i=1,2,\ldots,m\). This implies the result. \(\blacksquare\)
The size of \(nn_H\) is generally quite small compared to \(|(G_1,G_2,\ldots,G_m)\wr\wr H|\). Indeed, if one of the \(G_i\) is the trivial group, then we can obtain the same group with a smaller value of \(m\) (or we simply have \(H\) if \(m=1\)). Hence, we may assume that \(|G_i| \ge 2\) for all \(i\).
Also, if \(|\Omega_i| = 1\) for some \(i\), then the construction produces a direct product with \(G_i\). This is another special case.
Outside of the above special cases, \(s_i \ge 2\) and \(|G_i|\ge 2\) for every \(i=1,2,\ldots,m\). Thus, since \(n_{G_i} \le |G_i|\) for each \(i\), we have that \(nn_H \le |G_1||G_2|\ldots|G_m||H|.\) Hence, for any \(x\in (G_1,G_2,\ldots,G_m)\wr\wr H\), we have that the ratio \(\frac{\textrm{ord}(x)}{|H|}\) is not more than the square root of \(|(G_1,G_2,\ldots,G_m)\wr\wr H|/|H|\).
Proposition. Suppose that \(A\) is a finite group and that there are groups \(G_1,G_2,\ldots,G_m,\) and \(H\) such that \(A \cong (G_1,G_2,\ldots,G_m)\wr\wr H\) is a non-trivial wreath product. Furthermore, let \(m\ge1\) be minimal so that this is the case. If there exists \(a\in A\) with \(\textrm{ord}(a) = \frac{|A|}p\) for a prime \(p\) then one of the following holds:
(i) \(A \cong G_1\times G_2\times \ldots\times G_m \times H\), or
(ii) either \(m=1\) and \(A \cong \mathbb Z_p\wr H\) or \(A\cong (\mathbb Z_p\wr H)\times G_2\times\ldots\times G_m\).
Proof. Using \(n\) for the least common multiple of \(n_{G_i}\), for \(i=1,2,\ldots,m\), the previous lemma implies that \(\textrm{ord}(a)\ \big|\ nn_H\). For each \(i\), since \(n_{G_i}\) is the least common multiple of orders of elements in \(G_i\) and \(|G_i|\) is also a common multiple, we have \(n_{G_i}\ \Big|\ |G_i|\). Putting this together, we have that
which implies \(|G_1|^{s_1-1}|G_2|^{s_2-1}\ldots|G_m|^{s_m-1}\ \Big|\ p\). If \(|G_i|=1\) for some \(i\) then, having \(H\) act on \(\Omega\setminus\Omega_i\), we have
\((G_1,\ldots,G_m)\wr\wr H \cong (G_1,\ldots, G_{i-1},G_{i+1},\ldots,G_m)\wr\wr H\).
This is not possible as our choice of \(m\) was minimal. Thus, we have \(|G_i|\ge 2\) for all \(i=1,\ldots,m\). Hence, at most one of \(s_1,s_2,\ldots,s_m\) is larger than 1 and if \(s_i > 1\) then \(s_i=2\) and \(|G_i| = p\).
If \(s_i = 1\) for all indices, then we are in case (i): \(A \cong G_1\times G_2\times \ldots \times H\). On the other hand, suppose that \(s_1 = 2\) (WLOG), and \(s_i = 1\) for \(i > 1\). We then have that \(G_1 \cong\mathbb Z_p\). Since \(|\Omega_i| = 1\) for \(i > 1\), the operation defined on the wreath product is identical to the direct product for the other groups \(G_i\). An isomorphism \(\psi\) that realizes \(A\cong (\mathbb Z_p\wr H)\times G_2\times\ldots\times G_m\) may be defined by
Numerical integration (sometimes called numerical quadrature) is a term referring to a variety of procedures (or algorithms) for approximating the numerical value of a definite integral. There are a number of scenarios when using numerical integration on a function \(f(x)\), rather than finding the value of the integral through an anti-derivative, can be advantageous.1 First, you may only have access to some sampled values of \(f(x)\), meaning that you do not have a functional expression for \(f(x)\). This is very common in applications. Second, even if you do have an expression in the variable that defines the function, it may be impossible to find an anti-derivative in terms of elementary functions. We saw examples where this was the case when discussing arclength and surface area. Finally, there are situations when you can compute an anti-derivative, but doing so is much more computationally costly than using numerical integration.
The Trapezoid Rule
One method used in numerical integration is the trapezoid rule. To understand the trapezoid rule, we recall how Riemann sums approximate the area under a curve.
When \(f(x) \ge 0\) for \(x\) within the interval \([a, b]\), then the integral \(\int_a^b f(x)\ dx\) can be interpreted as the area between the graph of \(f(x)\) and the \(x\)-axis. Riemann sums approximate that area using areas of rectangles and dividing the interval \([a,b]\) into some number of subintervals \([x_i, x_{i+1}]\), with \(0\le i\le n-1\), where \(x_0 = a\). In most introductions to Riemann sums, each subinterval has width \(\Delta x = (b-a)/n\), and a rectangle for \([x_i, x_{i+1}]\) has width \(\Delta x\) and height equal to either \(f(x_i)\) or \(f(x_{i+1})\).
In the trapezoid rule, we replace each rectangle with a trapezoid. The trapezoid for the interval \([x_i, x_{i+1}]\) has corners at the four points \((x_i,0)\), \((x_i, f(x_i))\), \((x_{i+1}, f(x_{i+1}))\), and \((x_{i+1}, 0)\). In other words, the top of the trapezoid region is on the secant line for \(f(x)\) between \(x=x_i\) and \(x=x_{i+1}\). The area of that trapezoid is
The function value \(f(x_1)\) appears twice since \(x_1\) is an endpoint in two of the intervals. In the approximation from a sum of \(n\) trapezoid areas, the same will occur for values \(f(x_i)\) with \(1\le i\le n-1\). So, to approximate the integral of \(f(x)\) over \([a,b]\), we compute
In order for SageMath to make this computation, it needs to know the function \(f(x)\), the endpoints \(a\) and \(b\), and how many subintervals will be used — the value of \(n\). Hence, we should be able to write a custom procedure that takes these four items as input and then makes and returns the trapezoid rule computation. Within the procedure, we will need to determine each \(x_i = a + i\Delta x\).
Say that we want to call the procedure (or function) trap_rule. To make the definition of trap_rule in SageMath we should have code that looks like the following.2
def trap_rule(f, a, b, n):
# lines that determine Deltax and the list of subinterval endpoints X[i]
# lines assigning a variable (call it: trap_sum)
# the value of the sum in the trapezoid rule, times Deltax/2
return trap_sum
Recall that \(\Delta x = (b-a)/n\). I also hinted at how to get the list of \(x_i\) values in a paragraph above. Each \(x_i\) is equal to \(a + i*\Delta x\). We want to get these into a list. The Python way to do this is to put the following, once Deltax is assigned.
X = [a + i*Deltax for i in range(n+1)]
Then, we can reference \(x_i\) by typing X[i]. (An equivalent way to do this, with more lines of code, is the following for loop.)
X = []
for i in range(n+1):
X.append(a + i*Deltax)
In SageMath and Python, say that a variable (let’s call it part_sum) has been assigned a numerical value. To add a number to part_sum, say 3 (so that its value is 3 larger the next time you use part_sum), you can type the following.
part_sum += 3
# this adds 3 to part_sum, then reassigns that larger number to part_sum
So, to get our value of trap_sum that will be the approximation to the integral we can put the following in the trap_rule definition (after assigning Deltax and X).
part_sum = f(X[0])
for i in range(1,n):
part_sum += 2*f(X[i]) # 2*f(X[n-1]) is the last term in the loop
part_sum += f(X[n])
trap_sum = (Deltax/2)*part_sum
Exercise 1. Using what we have discussed in this section, fill out the definition of trap_rule (to get a working procedure) using 7 lines of code that are between the first line def trap_rule(f,a,b,n): and the last line return trap_sum. After running the code block containing the definition of trap_rule, evaluate the following two lines in a new code block (getting an approximation to the integral \(\int_0^3\frac{1}{1+x^2}\ dx\) that uses 10 trapezoids). When you do, you should get an approximation equal to 1.24859642022472.
f(x) = 1/(1+x^2)
trap_rule(f, 0.0, 3.0, 10)
Exercise 2. With your procedure trap_rule, use the trapezoid rule with 50 trapezoids to approximate the following two integrals.
\(\int_0^2 e^{(-x^2)}\ dx\)
\(\int_1^3 (\sin(3\ln(x)) + 0.3)\ dx\)
Implementing the Midpoint Rule
The midpoint rule approximates the integral \(\int_a^b f(x)\ dx\) with areas of rectangles over the subintervals \([x_i, x_{i+1}]\), like a right-hand or left-hand Riemann sum. However, the heights of the rectangles do not use values of \(f(x)\) at one of the endpoints; instead we compute the midpoint \(\overline x_i = \frac{x_i+x_{i+1}}{2}\) and set the height of the rectangle over \([x_i, x_{i+1}]\) equal to \(f(\overline x_i)\).
Written out in full, for the midpoint rule we compute
To implement this, we can define a procedure in a similar way to how we did with trap_rule. One difference is that, after assigning the list X as we did before, we will want a list of the midpoints. In other words, after the definition of X insert the line below to make a list of midpoints M.
M = [0.5*(X[i]+X[i+1]) for i in range(n)]
In the definition of trap_rule, we made it so that part_sum was equal to \(f(x_0)+2f(x_1)+\ldots+2f(x_{n-1})+f(x_n)\), before multiplying it by \(\Delta x/2\) and returning that as the approximation. In order to get the midpoint rule approximation, let’s make part_sum equal to \(f(\overline x_0) + f(\overline x_1) +\ldots+ f(\overline x_{n-1})\). Doing so only requires a slight adjustment to the lines that deal with part_sum:
part_sum = f(M[0])
for i in range(1,n):
part_sum += f(M[i])
In addition to those changes, make sure that you do not add f(X[n]) to part_sum after the for loop. Also, for the midpoint approximation you multiply this part_sum by \(\Delta x\) rather than \(\Delta x/2\).
Exercise 3. Within SageMath, in a new code block, define a procedure mid_rule that computes the midpoint rule approximation. This can be done by adapting parts of the definition of trap_rule as explained above. After running the code block with your definition of mid_rule, evaluate the following two lines in a new code block to use the midpoint rule to approximate the integral \(\int_0^3\frac{1}{1+x^2}\ dx\) with 10 rectangles. When you do, you should get an approximation equal to 1.24927020548728.
f(x) = 1/(1+x^2)
mid_rule(f, 0.0, 3.0, 10)
Exercise 4. With your procedure mid_rule, use the midpoint rule with 50 rectangles to approximate the following two integrals.
The indentation after the first line is an important part of the syntax. In SageMath and Python, after beginning a procedure’s definition, or a for loop, use indented lines until you want to finish that definition or for loop. ↩
This is an introduction to an approximation technique called Fourier approximation. Given a particular function \(y = f(x)\), a Fourier approximation of \(f(x)\) is defined in the following way.
First, for each \(k = 0, 1, 2, \ldots\) we compute the following numbers (“coefficients”)
Notice that, with \(k = 0\), we get that \(\frac12 a_0\) is the average value of \(f(x)\) on the interval \([-\pi, \pi]\) since \(\cos(0)=1\). Also, \(b_0 = 0\) since \(\sin(0) = 0\). Now the \(n^{th}\) Fourier approximation is defined as
For example, the function \(f_2(x)\) will be \(\frac{a_0}{2} + a_1\cos(x) + b_1\sin(x) + a_2\cos(2x) + b_2\sin(2x)\).
Constructing the Fourier Approximation in SageMath
Let’s begin by constructing Fourier approximations for an example function, \(f(x) = x^2\). Once we’ve evaluated a line with the definition f(x) = x^2, we would store the coefficient \(a_2\), for example, in SageMath by setting a variable equal to
1/pi*integral( f(x)*cos(2*x), x, -pi, pi ) \(\left(1\right)\)
As you will see below, we assign an entire list of these coefficients to the variable name a. The coefficient \(a_2\) will be in position 2 of the list, and we can get that coefficient from the list by typing a[2].
Avoiding a pitfall. When \(f(x)\) is a polynomial of \(x\) or \(e^x\) then computing the exact integrals for Fourier coefficients (using anti-derivatives) is relatively quick. For other functions, like the one in Exercise 4, it is more difficult. In order to sidestep the issue, in this lab we’ll use approximate integration to compute Fourier coefficients. We replace the command from \((1)\) with the command below (the [0] makes the output be the approximate integral value only, ignoring the error bound).
Exercise 1. For the function \(f(x) = x^2\), find the Fourier coefficients \(a_0, a_1\), and \(b_1\), displaying your output in your Sage worksheet.
Say we want to find the first 20 coefficients, \(a_k\) and \(b_k\) for \(k\) between \(0\) and \(20\). Then writing down one line at a time like this will be quite a pain! There is a nice succinct way to make a list of all the coefficients that you want. Say that you want to get these coefficients for \(k\) between \(0\) and \(6\). Then you can evaluate the following lines.
a=[ numerical_integral(1/pi*f(x)*cos(k*x), -pi, pi)[0] for k in range(7) ]
b=[ numerical_integral(1/pi*f(x)*sin(k*x), -pi, pi)[0] for k in range(7) ]
(Make sure to scroll sideways with the line above so that you get the entire line.)
These lines make two lists of numbers, one called a and one called b. There are 7 numbers in each list; for example, list b has the numbers \(b_0, b_1, b_2, b_3, b_4, b_5, b_6\), in that order. After SageMath computes these, you can look at \(b_3\) by typing b[3]. (Note where the 7 appears in range(7). This means that values of k go through the list [0, 1, 2, 3, 4, 5, 6].)
Once we have these numbers, if we want to define the \(5^{th}\) Fourier approximation, we can do so with
f5(x) = a[0]/2 + sum( [a[k]*cos(k*x) + b[k]*sin(k*x) for k in range(1,6)] )
What is being done in this SageMath definition of f5(x)? Start by looking at it, and comparing to the definition of \(f_n(x)\) in the introduction.
There is a list of functions that equal coefficient a[k] times cos(k*x) plus b[k]*sin(k*x), one for each k; the range command in that list forces k to start at 1 and go up until 5 (instead of starting at 0). Then I put the list inside of the command sum( ) to add the functions together.1 This is added to a[0]/2.
After computing the Fourier approximation f5(x) for the function \(f(x) = x^2\), we can plot it together with f(x) over the interval \([-\pi, \pi]\). This lets us see how close its values are to those of \(f(x)\). Displaying a plot of the functions together, \(f_5(x)\) in red and the original \(f(x)\) in blue, can be done as follows.
Exercises 2 – 4. For each of the following functions, compute the \(20^{th}\) Fourier approximation \(f_{20}(x)\). Then plot the functions \(f(x)\) and \(f_{20}(x)\) in the same plot over the interval \([-\pi, \pi]\).
\(f(x) = x^3.\)
\(f(x) = e^x.\)
\(f(x) = \dfrac{x^2-x}{1+x^4}.\)
Interval of Approximation
When plotting the functions above, you may have noticed that, over \(x\) values near the endpoints of \([-\pi, \pi]\), the \(y\)-values of the approximation were not very close to the \(y\)-values of \(f(x)\). It gets worse if you compare the two functions outside of the interval \([-\pi, \pi]\) (the interval used for integration). Can this be fixed?
Consider a function \(f(x)\) defined on an interval \([-c, c]\). Recall that multiplying the input by a constant affects the graph by stretching it horizontally. In other words, if \(x_1 = 2x_0\) for a number \(x_0\) in the interval \([-c, c]\), then the value \(y_1 = f(x_1)\) is the same as \(f(2x_0)\).
If we can approximate the function \(f(2x)\) well on the interval \([-\pi, \pi]\), then we should be able to use that to get a good approximation to \(f(x)\) on the interval \([-2\pi, 2\pi]\).
Setup for Exercise 5. Let \(g(x) = e^x\) and let \(f(x) = g(2x)\). The function \(f_{50}(x)\), a Fourier approximation of \(f(x)\), approximates \(g(2x)\) on the interval \([-\pi, \pi]\). So, on the interval \([-2\pi, 2\pi]\), the function \(f_{50}(x/2)\) should be a good approximation to \(g(x)\).
Exercise 5. Compute the Fourier approximation \(f_{50}(x)\) of the function \(g(2x)\), as explained above. Then, display a plot of \(e^x\) along with the good approximation to it, on the interval \([-2\pi, 2\pi]\).
You can always add up a list of things this way in SageMath (or in Python). ↩
Given a function \(y=f(x),\ a \le x \le b\), we see how to find the length of the curve, along the graph, from \((a,f(a))\) to \((b,f(b))\). If this curve is revolved about an axis, it generates a surface of revolution, and we see how to find the resulting surface area.
Arclength
The arclength along \(y=f(x)\) from \((a,f(a))\) to \((b,f(b))\) is determined from the formula
Except for in carefully chosen cases, integrals of this form are impossible to calculate without using numerical methods (approximations). This makes computational software especially useful for computing arclength. We’ll begin with an example where the function \(f(x)\) is “carefully chosen,” so that \(\sqrt{1 + \left(f'(x)\right)^2}\) has an antiderivative in terms of elementary functions.
Example 1. Find the arclength of \(f(x) = \frac23 x^{3/2}\) from the point on the graph where \(x=0\) to the point where \(x=1\). Since \(f'(x) = x^{1/2}\), the integral to be calculated is
\(\int_0^1 \sqrt{1 + x}\ dx.\)
Within SageMath the antiderivative can be calculated by the command integral(sqrt(1+x), x). In order to get the definite integral that is the arclength1 the SageMath command is integral(sqrt(1+x), x, 0, 1), which produces the answer \(\dfrac{4}{3}\sqrt{2}\ – \dfrac{2}{3}\).
Example 2. Say that you want to find the arclength along the sine curve \(y=\sin(x)\) from \((0,0)\) to \((\pi,0)\). In SageMath, once a function f(x) has been defined (by putting a line f(x)=... that defines it), you can find the derivative by the command diff(f, x). This means that the following would give the arclength of the sine curve above:
If you’ve gone and attempted to compute the above, you may confused when the output is not any number. Instead, it looks a lot like the line you typed in (especially if you determined the derivative, \(\cos(x)\), yourself; your output should be something like integrate(sqrt( (cos(x))^2 + 1 ), x, 0, pi) .)
So…what is going on? It is not just an issue in SageMath; if you were working in another computational software system, like Mathematica, you would also get a weird answer: \(\sqrt{2}\ \texttt{EllipticE}\left[\texttt{x}, \frac12\right]\). The reason for these answers is that the function \(\sqrt{\cos^2(x) + 1}\) does not have an anti-derivative that is in terms of elementary functions; one cannot use an antiderivative in the Fundamental Theorem of Calculus to find the definite integral.
However, you can get an approximate numerical value of the integral. To do so in SageMath, you could put the following2
and get an ordered pair (3.820197789027712, 4.241271544000682e-14). The first number is the approximate value of the integral. The second number is an upper bound on the error; that is, a maximum amount by which the true value of the integral might differ from the approximation.
Exercises
For each of the following curves \(y=f(x)\), if the function being integrated in the formula for arclength has an anti-derivative in terms of elementary functions then write down that anti-derivative. Otherwise, say it does not exist. In both cases, have a computation in your notebook that supports this. Finally, find a numerical value for the arclength.
In this post I will discuss expectation of random variables, as well as the variance of a random variable. For notation and conventions, please refer to my earlier post on random variables.
The expected value of discrete random variables
Consider a discrete random variable \(X:\Omega \to \mathcal T\), with an associated probability function \(p_X\). By definition, the expected value of \(X\) (or expectation, denoted \(\mathbb E[X]\)) is given by1
From one perspective, this is simply the definition and so there is no need for justifying it. However, in order to have some intuition it is helpful to connect it with the more familiar notion of an average, and this is especially clear for a special case (when \(p_X\) is particularly simple).
Example 0. Let \(X\) be a uniformly distributed discrete random variable: suppose that \(|\mathcal T| = m\) and \(p_X(x) = \frac{1}{m}\) for all \(x\in\mathcal T\) (e.g. one roll of a “fair” die, or the flip of a “fair” coin, or what people running polls are theoretically supposed to do when taking a sample). Then \(\mathbb E[X]\) is the average of the numbers in \(\mathcal T\).
Example 1. Recall the example discussed in my initial post on random variables, considering a random variable \(X\) equal to the maximum number showing upon rolling two 6-sided dice. The probability mass function (pmf) for this \(X\) is given by:
\(x\)
1
2
3
4
5
6
\(p_X(x)\)
1/36
3/36
5/36
7/36
9/36
11/36
And so, \(\displaystyle\mathbb E[X] = \sum_{i=1}^6 i\ p_X(i) = \dfrac{161}{36} \approx 4.47.\)
In Example 1, the expected value is not the average of the possible values of \(X\) (average of \(\{1,2,3,4,5,6\}\) is \(3.5\)). An often-used phrase is that this is a weighted average, where \(p_X\) is determining the weight of each value of \(X\). 2 Considering the origin of the probabilities \(p_X(i)\), the sample space \(\Omega\) had 36 outcomes, and the numerator in \(p_X(i)\) is the number of outcomes with maximum number \(= i\). Each outcome produces a max: for example, \(\text{max}=6\) comes from 11 different outcomes, and \(\text{max}=5\) from 9 different outcomes, etc. The simple average of these is
which is precisely \(\mathbb E[X]\). Takeaway: when the probability on \(\Omega\) is defined by \({\bf P}(A) = |A|/|\Omega|\), the expected value of \(X\) is nothing more than the simple average of the numbers \(X(\omega)\), letting \(\omega\) range over \(\Omega\).
The expected value of continuous random variables
Let \(X:\Omega \to \mathcal T\) be a continuous random variable with an associated probability density function \(p_X\). As in the previous post, we want to consider only some “nice” subsets of \(\mathbb R\) for the target space \(\mathcal T\) of \(X\). For now, it is enough to assume that either \(\mathcal T = [a,b]\), an interval with \(a < b\), or \(\mathcal T = \mathbb R\).
The expected value of \(X:\Omega \to [a,b]\) is the value of \(\mathbb E[X] = \int_a^bxp_X(x)\ dx\). 3\).] Like in the discrete case, it is appropriate to think of this number as the weighted average of the values that \(X\) might take.
Exercise 1. Compute the necessary integral to verify that if \(X\) is uniformly distributed on \([a,b]\), then \(\mathbb E[X] = (a+b)/2\).
Exercise 2. Compute the necessary integral to verify that if \(X \sim \mathcal N(\mu, \sigma^2)\) is normal, with \(\mathcal T = \mathbb R\), then \(\mathbb E[X] = \mu\). [Two separate substitutions, or one substitution followed by thinking about odd functions, will succeed.]
Exercise 3. Given a number \(\mu > 0\), define \(p(x) = \frac{1}{\mu}e^{-x/\mu}\) for all \(x\ge 0\). Let \(\mathcal T = [0,\infty)\). Then, for some sample space \(\Omega\), a random variable \(X:\Omega \to\mathcal T\) is called exponentially distributed if there is some \(\mu > 0\) so that \(p_X = p\). 4 Find the expected value of such a random variable.
New random variables from old
Say you have a random variable \(X:\Omega\to\mathcal T\), either discrete or continuous (but sticking to target spaces that are subsets of \(\mathbb R\)). Say you also have a function \(f: \mathbb R\to\mathbb R\). 5 By defining \(Y(\omega) = f(X(\omega))\), for every \(\omega\in\Omega\), you get a new random variable. The target space of \(Y\) is \(f(\mathcal T)\).
Even if you don’t have an explicit understanding of \(p_Y\), you can still figure out \(\mathbb E[Y]\) by using the distribution of \(X\). To begin, say that \(X\) is a discrete random variable (with finite \(\mathcal T\)). Then \(f(\mathcal T)\) is finite and you have \(\mathbb E[Y] = \sum_{y_i\in f(\mathcal T)}y_ip_Y(y_i)\). However, note that
\(= \{\omega\in \Omega\ |\ \text{there is }x \text{ with } X(\omega)=x \text{ and } f(x)=y_i\}.\)
Let \(y_i\in f(\mathcal T)\) and \(\mathcal T = \{x_1,x_2,\ldots,x_n\}\). If \(x_j\ne x_k\) then \(X^{-1}(\{x_j\})\) and \(X^{-1}(\{x_k\})\) are mutually exclusive, and so \(\displaystyle p_Y(y_i) = \sum_{j: f(x_j)=y_i}p_X(x_j)\). Using this equation, and using that an element in \(\mathcal T\) can only be in the preimage of a single element of \(f(\mathcal T)\), and also rearranging summands, we can rewrite the expected value of \(Y\):
In fact this does work. 6 I will give an explanation for why in a later section. The slogan is that using continuous variables involves a limit of what happens with discrete random variables.
Exercise 4. Let \(X\) be a continuous random variable, uniformly distributed over \([a,b]\). Find the expected value of \(Y = X^2\). [Your answer should tell you that \(\mathbb E[Y]\) has the following values for the corresponding intervals.]
\([a,b]\)
\([-1,1]\)
\([1,2]\)
\([1,10]\)
\(\mathbb E[X^2]\)
1/3
7/3
37
Variance
Fix some number \(\mu\). Given random variable \(X\), define a new random variable \(V = (X – \mu)^2\). If you choose \(\mu = \mathbb E[X]\), then the variance of \(X\) is given by \(\mathbb E[V]\), which is equal to (both discrete and continuous cases):
When \(X\) is a random variable that has a normal distribution \(\mathcal N(\mu, \sigma^2)\), then the number \(\sigma^2\) in the notation is the variance of \(X\). Variance is a way to measure how diffuse (or “spread out”) the random variable \(X\) is. The larger the variance, the more it is spread out. Since it uses squared differences from the mean \(\mu=\mathbb E[X]\), it is not equal to the expected(average) distance from \(X\) to its mean. 7 The average distance to the mean would be equal to the expected value of the random variable \(|X – \mu|\).
Exercise 5. Let \(X\) be a uniformly distributed random variable over the interval \([a, b]\). For the length of the interval, write \(\ell = b – a\), and notice that \(b – \mathbb E[X] = \mathbb E[X] – a = \ell/2\). Compute the variance of the variable \(Y = X^2\); note that it can be written simply as function of \(\ell\).
Expected value of function of a random variable: Rationale
🐉🐉 Given a continuous random variable \(X\) with target space \(\mathcal T\), and function \(f\) that has \(\mathcal T\) in its domain, define \(Y = f(X)\). For simplicity assume that \(\mathcal T\) equals an interval \([a,b]\), though the reasoning here would extend to other cases. In any case, I also assume that \(f(x)p_X(x)\) is integrable over \(\mathcal T\).
Given a number \(n\in\mathbb N\), choose a subdivision \(P_n\) of the interval: \(a = x_0 < x_1 < \ldots < x_n = b\), with a set of points \(x_i^*\), where \(x_{i-1} \le x_i^* \le x_i\) for \(i = 1,2,\ldots, n\). Define \(\mathcal T(P_n) = \{x_i^*\ |\ i = 1,2,\ldots, n\}\).
With this choice of points and subdivision, a discrete random variable \(X_n:\Omega\to\mathcal T(P_n)\) can be defined as follows. For \(\omega\in\Omega\), there is a unique \(i\) such that \(x_{i-1}\le X(\omega) < x_i\) (in the case \(X(\omega) = x_n = b\), let this \(i = n\). Using that unique \(i\) for \(\omega\), define \(X_n(\omega) = x_i^*\). Then, define \(Y_n(\omega) = f(X_n(\omega))\).
Note that \(\displaystyle p_{X_n}(x_i^*) = {\bf P}(x_{i-1}\le X <x_i) = F_X(x_i) – F_X(x_{i-1})\). And so, using the formula we have for expectation of a discrete random variable,
If you replace \((F_X(x_i) – F_X(x_{i-1}))\) in the above sum with \(\Delta_i F_X\), then this sum becomes recognizable as the Riemann sum for \(P_n\) that would have limit equal to \(\int_{\mathcal T}f(x)\ dF_X.\) The differential \(dF_X\) is equal to \((F_X)'(x)\ dx = p_X(x)\ dx\). And so, our assumptions about integrability imply that 8
Some details remain. If a slightly stronger assumption is imposed on \(f\), namely that it is piecewise continuous on \([a,b]\), then we can be done. Indeed, if \(f\) is piecewise continuous, then the limit of \(Y_n\) must agree with \(Y\) almost everywhere (at all but, possibly, finitely many points, in fact). Then it must be that the expected value of \(Y\) is the same as the above limit.
While this is easiest to think about when \(\mathcal T\) is finite (and recall that we had said in the last post that our discrete variables would be just those with \(\mathcal T\) being finite), this definition will work if \(\mathcal T\) is a countably infinite set as long as the infinite sum is absolutely convergent. For example, it could be that \(\mathcal T = \mathbb N\) and that \(p_X(n)\) is exponentially decaying as \(n\to\infty\). For an example of an honest probability mass function \(p_X\) that does not give \(X\) a well-defined expectation? Consider \(p_X(n) = \dfrac{6}{\pi^2n^2}\) (here \(\mathcal T = \mathbb N\)). ↩
Formally, using the phrase weighted average is simply giving another name to the expected value. But giving something a helpful name can be surprisingly useful. ↩
Provided that the function \(x\ p_X(x)\) is integrable over \([a,b ↩
Note that the only possible values of \(X\) are in \([0,\infty)\). Alternatively, one could have \(\mathcal T = \mathbb R\) by setting \(p_X(x) = 0\) for any \(x < 0\). ↩
In earnest, you only need the domain of \(f\) to contain \(\mathcal T\). ↩
Provided that the function \(f(x)p_X(x)\) is integrable over \(\mathcal T\). ↩
Nor is the square root of the variance (the standard deviation). ↩
The limit involving \(P_n\) such that the maximum width of a subinterval \((x_{i-1}, x_i)\) in \(P_n\) approaches \(0\) as \(n\to\infty\). ↩
In this post I will go over the basics of random variables and their associated distributions. I will begin with discrete random variables and then will discuss continuous random variables. My focus will mainly be on random variables with values in \(\mathbb R\), though I will briefly mention other types.
Afterwards, I will discuss two different senses that you can think of random variables arising when analyzing data. In the post, some basic knowledge of probability will be assumed. If you feel that you need a review, or don’t remember the jargon, work through the post Probability Refresher.
Before I jump into the details, informally a random variable is just a way of assigning numbers to outcomes (from a sample space).1 Don’t get too hung up on the word ‘random’ that is used here. The idea is that there is an element of chance to what the value of the variable will be (the chance coming from which outcome occurs), and that’s it. Here are a few examples, trying to illustrate the breadth of subjects where random variables arise.
Example 1. Say that each review on a website must choose a rating of 1 through 5 stars (no fractions of stars being allowed). Call each review an outcome, and make the rating given be the value of the random variable on that outcome.
Example 2. Based on observations, it is estimated that about 1 in 5 “Sun-like” stars have an Earth-sized planet in their habitable zone. Make an outcome be a search for such a planet around 5 Sun-like stars, and the random variable equal to the number of stars in the search where such a planet is found.
Example 3. The height of a 2-year-old child who was selected from a group of 2-year-olds. The child that you select is the outcome and their height is the value of the random variable.
Example 4. How many dollars a customer that visits a particular store/site will spend, given that they do spend money. The outcome is the visit that results in a purchase, the value is the dollar amount.
More formally, let \((\Omega, {\bf P})\) be a probability space (a sample space with a given probability function \({\bf P}\)). A random variable \(X\) is a function that has domain \(\Omega\), and takes values in some target space \(\mathcal T\) (writing \(X:\Omega \to \mathcal T\)). I will use the convention that \(X\) is onto. So, every \(x\in\mathcal T\) equals \(X(\omega)\) for some \(\omega\in\Omega\).2 For most of this post, the target space is a subset of the reals, \(\mathcal T \subseteq \mathbb R\), and often a “small” subset of \(\mathbb R\).3
discrete Random variables
A discrete random variable \(X\) is a random variable where \(\mathcal T\) is finite.4 The variables in Example 1 and 2 above are discrete random variables, as are the examples just after this paragraph. Whenever \(\Omega\) is a finite set, \(\mathcal T\) must be finite, making \(X\) discrete. The target space can also be finite when \(\Omega\) is infinite. Until the section on continuous random variables, all random variables should be assumed to be discrete.
Example 5. The maximum number showing after rolling two 4-sided dice. This is a random variable where \(\Omega\) is the set of results you can get by rolling two dice. So, if \((1,4)\) represents the outcome of rolling and having one die show a 1 and the other show a 4, then the value of the random variable is \(4\).
Example 6. The Blackjack score, upon drawing two cards. (For simplicity, assume you are the only one being dealt two cards from the deck.) In this example, \(\Omega\) is the set of two-card hands. For example, if you get a hand of \(\{Q\heartsuit, 9\clubsuit\}\) then the numerical value is \(19\).
In Example 5, \(\mathcal T = \{1,2,3,4\}\). In Example 6, an Ace can be interpreted as 11 points or 1 point. If you decide that the score is the largest possible (without going over 21), then \(\mathcal T = \{4, 5, 6, 7, \ldots, 20, 21\}\).
Every discrete random variable \(X:\Omega\to\mathcal T\) has an associated function \(p_X\) that assigns a probability: \(p_X(x) = {\bf P}(X = x)\) for each \(x\in\mathcal T\). In terms of events, \(p_X(x) = {\bf P}(\{\omega\in\Omega\ |\ X(\omega) = x\})\).
Following others, when the random variable is unambiguous, I will simply write \(p(x)\). This function \(p\) is called the probability mass function of \(X\) (pmf for short). 5 The figure displayed represents the graph of the pmf of the “Blackjack score” random variable of Example 6.6 Each bar at \(x\) has height equal to \(p(x)\).
Returning to Example 2. Let \(X\) be the number of stars, from five searches, where a habitable, Earth-sized planet is found. What is the pmf \(p(x)\) of this random variable? Its target space will be \(\{0,1,2,3,4,5\}\).
First, we work from the estimate that 1 in 5 of the Sun-like stars have such a planet. So, for each searched star there is a \(20\%\) chance of finding one (and \(80\%\) chance of not finding one).
For none to be found, the \(80\%\) probability event happens 5 times, so \(p(0) = (0.8)^5 = 0.32768\). When 1 planet is found, then exactly one of the searches resulted in the \(20\%\) probabilitiy event; since there are \(5=\binom{5}{1}\) possibilities for which search that occured in, \(p(1) = 5\cdot(0.2)\cdot(0.8)^4 = 0.4096\). When 2 planets are found, then two of the searches resulted in the \(20\%\) probabilitiy event; accounting for the possible ways this can occur, we get \(p(2) = \binom{5}{2}\cdot(0.2)^2\cdot(0.8)^3 = 0.2048\). The full pmf is given by the following table.7
\(x\)
0
1
2
3
4
5
\(p(x)\)
0.32768
0.4096
0.2048
0.0512
0.0064
0.00032
Exercise 1. Find the four values of the pmf of the random variable in Example 5.
I will also sometimes write \(X\sim p(x)\), which is read as either \(X\) is distributed according to \(p(x)\), or as \(X\) issampled from the distribution \(p(x)\).
Exercise 2. Show: If \(p(x)\) is a pmf then \(p(x)\geq 0\), for every \(x\in\mathcal T\), and \(\sum_{x\in\mathcal T}p(x) = 1\).
You can think of \(\mathcal T\) as a sample space. Then the function \(p_X:\mathcal T \to [0,1]\) for a random variable \(X\) determines a probability function on \(\mathcal T\), where the probability of a subset is the sum of values of \(p_X\) on its elements. That is, for \(A\subseteq \mathcal T\), a probability function is given by
\({\bf P}_X(A) = \sum_{x\in A} p_X(x)\).
Exercise 3. Check the claim that \({\bf P}_X\) defines a probability function (recalling the three properties required for probability functions).
It is common to use \({\bf P}(X \leq x)\) for the probability of the outcomes \(\omega\) where \(X(\omega) \leq x\). The cumulative distribution function (or cdf), which I will write as \(F_X:\mathcal T \to [0,1]\), records these probabilities. That is, for each \(x\in\mathcal T\),
\(F_X(x) = {\bf P}(X \leq x)\).
Since our discrete random variables have finite \(\mathcal T\), there is a smallest number in \(\mathcal T\), and just finitely many between that smallest and \(x\). And so, \(F_X(x)\) can be found by summing \(p_X(t)\) over those \(t\in\mathcal T\) where \(t\leq x\):
\(F_X(x) = \sum_{t\leq x} p_X(t)\).
Like with the pmf of \(X\), I will drop the subscript and use \(F(x)\) when it does not create ambiguity.
Example 7. Change the random variable of Example 5 above to the maximum showing after rolling two 6-sided dice (assume each side has equal probability — a “fair” die). Let values of \(X\) be this maximum. Let’s find \(F_X(2)\). Think of there being a first roll and a second roll, the rolls being independent; so there are 36 outcomes in \(\Omega\). There is only one way for \(X = 1\): when both rolls are 1, and so \(p(1) = 1/36\). To get \(X=2\) requires either rolling one 1 and one 2, or rolling two 2’s. The first one happens in two ways, 1 then 2, or 2 then 1. The second possibility only happens in one way. Therefore, \(p(2) = 3/36\) and so,
Exercise 4. Using the \(X\) defined in Example 6, find \(F(20)\).
[Hint: It is easier to find \(p(21) = {\bf P}(X = 21)\). How does this help to find the probability that you want?]
Joint distributions of two random variables
Given two random variables \(X\) and \(Y\), with the same sample space, conditional probabilities have a familiar notation. Specifically, \({\bf P}(X=x\ |\ Y=y)\) is the probability that \(X(\omega)=x\) when \(\omega\) is drawn from \(Y^{-1}(\{y\})\). The joint probability \({\bf P}(X=x, Y=y)\) is the probability of \(X^{-1}(\{x\})\cap Y^{-1}(\{y\})\); that is, the probability in \(\Omega\) of the subset of \(\omega\) such that both \(X(\omega)=x\) and \(Y(\omega) = y\) hold.
Because of the definition of conditional probability, \({\bf P}(X=x\ |\ Y=y) = \frac{{\bf P}(X=x, Y=y)}{{\bf P}(Y=y)}\).
Returning to Example 6. Let \(X\) be the score of the two-card hand, and let \(Y\) be the number of face cards in the two-card hand. This means that the target space of \(Y\) is \(\{0,1,2\}\). Since there are \(\binom{12}{2}\) ways of getting 2 face cards, we have that
What are \({\bf P}(X=20, Y=2)\) and \({\bf P}(X=20\ |\ Y=2)\) equal to? First, given that you have 2 face cards, your Blackjack score will be 20. This means that \({\bf P}(X = 20\ |\ Y=2)\) should equal \(1\). To check that this works from the definition, the set of outcomes where \(Y=2\) is a subset of the outcomes where \(X = 20\), so the intersection is all outcomes where \(Y=2\), and so \({\bf P}(X=20, Y=2) = {\bf P}(Y=2)\). This means that
You can view the pair \((X, Y)\) as a single random variable; remember, they have the same sample space. However, the target space will have to be in \(\mathbb R^2\), instead of just being a set of real numbers. So, your random variable is \((X,Y): \Omega \to \mathcal T\), where \(\mathcal T\subseteq \mathbb R^2\). As both \(X\) and \(Y\) are discrete (in this section), the \(\mathcal T\) here is a finite set of points in the plane.
The random variable \((X, Y)\) has a pmf, defined by the joint probability, \(p(x, y) = {\bf P}(X=x, Y=y)\). It is also not uncommon to write \(p(x\ |\ y)\) to mean the conditional probability \({\bf P}(X=x\ |\ Y=y)\).8 The cdf for \((X, Y)\) is defined by setting
(where each \(t\) is in the target of \(X\), and \(u\) in target of \(Y\)).9
Joint probability table. With two discrete random variables \(X, Y\), you can write a probability table of the joint probabilities \(p(x,y)\) (this describes the pmf of \((X, Y)\)). Of course, once the size of \(\Omega\) gets to be too big, it makes less sense to do this.
Exercise 5. Given the joint probability table below, 10 find \(F_{(X,Y)}(2, 1) = {\bf P}(X\le 2, Y\le 1)\).
\(Y\ \big\\ \ X\)
1
2
3
4
3
1/8
0
0
0
2
1/8
1/8
0
0
1
1/8
1/8
1/8
0
0
1/16
1/16
1/16
1/16
The Sum Rule. Given a multivariate random variable \((X, Y)\), distributed according to \(p(x, y)\), it is immediate from definitions that \(p(x\ |\ y)p_Y(y) = p(x, y)\). There is another identity, allowing you to find \(p_X\) from the joint distribution.11 It can be quite useful and involves summing over possible values of \(Y\). Suppose that \(\mathcal T_Y = \{y_1, y_2, \ldots, y_k\}\) is the target space of \(Y\). Then we get the
Sum Rule
\(p_X(x) = \sum_{i=1}^kp(x, y_i)\), which also equals \(\sum_{i=1}^kp(x\ |\ y_i)p_Y(y_i)\).
Exercise 6. From the previous joint probability table, use the sum rule to find \(p_X(1)\) and \(p_Y(0)\). Your answers should satisfy \(p_X(1)p_Y(0) = \frac{7}{64}\); note this is different than the joint probability \(p(1, 0) = \frac{1}{16}\).
If it were the case that the joint probability equals the product of the individual probabilities, then the sum rule would follow simply from the fact that \(\sum_{i=1}^kp_Y(y_i) = 1\). But, this is usually not the case (see the exercise 6 above, and the section on independent variables below).
The reasons the identity holds is that (1) preimages of different values are mutually exclusive (if \(y_1\ne y_2\) then \(Y^{-1}(\{y_1\})\cap Y^{-1}(\{y_2\}) = \emptyset\)), and (2) intersections “distribute” over unions. In more detail, let \(S = X^{-1}(\{x\})\). Then, we get
Recall that events \(A\) and \(B\) are independent if \({\bf P}(A\cap B) = {\bf P}(A){\bf P}(B)\). Two random variables \(X, Y\) are called independent if one has \(F_{(X,Y)}(x, y) = F_X(x)F_Y(y)\) for every \((x, y)\) in the target space. 12 As we are working with discrete random variables, this is the same as having \(p(x, y) = p_X(x)p_Y(y)\) for every \((x, y)\). 13
The two random variables determined by the probability table in Exercise 5 are not independent. While random variables (that one tends to consider together) are often not truly independent, it is, nevertheless, a common assumption that is made in order to make computations more manageable (or even feasible, sometimes). In the habitable planet example, one could think of the result of each of the five searches as a random variable, and the computations that were done there assumed that any two of them were independent variables.
Continuous random variables
When the target space of a random variable is infinite, 14 then probabilities become more subtle. A continuous random variable is a random variable whose target space is some “nice” subset of \(\mathbb R\). To describe what is meant by “nice” is a long road,15 but it includes intervals of \(\mathbb R\), and a union of finitely many intervals. In this section, all random variables are continuous.
To illustrate the subtlety of an infinite target space, say that \(\mathcal T = [0,2]\), and that I am interested in having every number in the interval have equal probability. It quickly becomes clear that I cannot do things the same as in the discrete case. For example, if the probability \(p(x)\) is constant and positive for every \(x\in[0,2]\), then simply adding these won’t have a total of 1 (whether you think of that sum as not defined, or as infinite, it’s not what we want). OTOH, \(p(x)=0\) for each \(x\) would seem to make the total probability be zero (if sums are being used).
The point is that simple sums don’t work for continuous random variables. They must be replaced with integrals. When one does this, the probability functions are a little different. Specifically, limits are in play (in the background, as derivatives and integrals).16
However, a continuous random variable is still distributed according to a function, its probability density function (or pdf). I will cover a couple of common classes of pdf’s. For a function \(f\) to be eligible as a pdf of a random variable (with target space \(\mathcal T\)) it must satisfy
\(f(x) \geq 0\) for all \(x\in\mathcal T\)
\(\int_{\mathcal T}f(t)\ dt = 1\).
Exercise 7. Find a function \(f\), with \(\mathcal T\) being some interval, that satisfies these two conditions, but where \(f(x) > 1\) for some \(x\in\mathcal T\).
You might not be familiar with the form of the integral in the second condition. If \(\mathcal T\) is an interval, let’s say \(\mathcal T = [a,b]\), then the integral \(\int_{\mathcal T}f(t)\ dt\) is the same as \(\int_a^bf(t)\ dt\).
Rather than the values of \(f(x)\) representing the probability of \(X\) taking value \(x\), you should instead think of the probability of \(X \sim f(x)\) taking a value in some range, and this is handled with an integral:
\({\bf P}(x_1 \leq X \leq x_2) = \int_{x_1}^{x_2}f(t)\ dt\).
Uniform Distributions. Let \([a, b]\) be a closed interval of \(\mathbb R\), with \(a < b\). Define a constant function \(f:[a, b] \to \mathbb R\) by setting \(f(x) = \frac{1}{b-a}\). A random variable \(X:\Omega\to [a, b]\) is called uniformly distributed (over \([a,b]\)) if \(X\) is distributed according to this constant function\(f\). Equivalently, for any positive length \(\ell\), and for any \(x\in[a, b]\) that has \(x + \ell \leq b\), we get that
\({\bf P}(x \leq X \leq x+\ell) = \int_x^{x+\ell}\frac1{b-a}dt\).
Note that the value of this integral does not depend on \(x\). This means that for every subinterval of length \(\ell\), regardless of where it appears in \([a, b]\), the probability of \(X\) being in it is \(\frac{\ell}{b – a}\). In this sense, numbers in different parts of the interval have equal likelihood of being a value of \(X\).
The target space of this random variable is \([a, b]\), making \(a\) the smallest possible value, and so
Thus, the cdf of a variable that is uniformly distributed over \([a,b]\) is \(F(x) = \frac{x-a}{b-a}\).
Exercise 8. For a uniformly distributed \(X\), use the definitions of \({\bf P}(X\leq x)\) and \({\bf P}(x_1\leq X\leq x_2)\) found above, and properties of integrals, to verify the following.
For any \(x\in [a, b]\), \({\bf P}(X = x) = 0\).
For any \(x\in [a,b]\), \({\bf P}(X \geq x) = 1 – {\bf P}(X \leq x)\).
I’ll call your attention to the fact that, for this uniformly distributed variable, \(\frac{d}{dx}F(x) = \frac{1}{b-a} = f(x)\). The Fundamental Theorem of Calculus tells us that this holds generally, for any continuous random variable on \([a, b]\):
A random variable \(X\) which has this \(f(x)\) for its pdf is called normally distributed (with mean \(\mu\) and standard deviation \(\sigma\)). It is common to see people write \(X \sim \mathcal N(\mu, \sigma^2)\).
To be sure, it is more of a challenge to see that the integral over \(\mathbb R\) of this function is equal to 1,18 but it is. The most probable ranges of values of \(X\) are centered around \(\mu\).
As the general construction prescribes, the cumulative distribution function is \(F(x) = \int_{-\infty}^x f(t)\ dt\). However, unlike with uniformly distributed variables, this cdf does not have a “closed form.” To make computations for it, it is necessary to use approximate (numerical) computations of the integral. It can be worthwhile to be familiar with its value at \(x = \mu + k\sigma\) for some different multiples of \(\sigma\).19
Exercise 9. If \(X \sim \mathcal N(\mu, \sigma^2)\), and \(F(x)\) is the cdf of \(X\), explain why \(F(\mu) = 0.5\). [The explanation can be made without actually carrying out the calculation of an integral (assuming it is known that the integral over \(\mathbb R\) of \(f(x)\) is equal to 1). To do this, do a substitution \(u = x – \mu\), and think about the function \(f(u)\). Does it have a symmetry you can exploit?]
There are a number of random variables that occur “in nature,” which seem to be modeled well by assuming they have a normal distribution. Some examples might include: the strength of an observed hydrogen bond between two water molecules; the temperature reading from a thermometer, on a given individual; the weight of a 30-year old female from a given region.
Despite what was said in the previous paragraph, many naturally occurring random variables are not modeled well with an underlying normal distribution, e.g. incomes, time to the next earthquake, the population density of a locale, and more. One should be careful about making an assumption that a random variable is normally distributed.
Sampling Values of a Random Variable
For the random variables that I have discussed to this point (both discrete and continuous), I had complete knowledge of the probability distribution (the pmf for discrete variables, and the pdf for continuous variables). For example, when \(X\) was the Blackjack score of a two-card hand, knowing the drawing came from a (well-shuffled) 52-card playing deck, one can know by counting outcomes, what is the probability of any value of \(X\). When discussing a normally, or uniformly, distributed continuous variable, I essentially defined the random variable by the probability density function.
For real world examples of random variables, like those that I have brought up, you almost never have that complete knowledge. At the end of the last section, I mentioned a random variable with values being the weights of 30-year old females from a given region. Let’s consider it.
To be concrete, say that you give “30-years old” its calendar meaning: anyone whose 30th birthday has arrived, but their 31st birthday hasn’t yet (even if it is the next day!). Also, say the sample space is 30-years old individuals, identifying as female, who live in Spain. For an \(\omega\) in this set of people, \(X(\omega)\) is the exact weight of that person.
Now, there is an exact weight of each of the people mentioned, at a given time at least. So, \(X\) is truly distributed according to some function at that time. That distribution function is called the population distribution. But…think about the difficulties of determining this function. How do you get exact weights, for one? Even if you use some approximate weight, it changes a bit during the course of a day. Since there are millions of people in the set, can you realistically ever find the population distribution?
The distributions of most random variables similarly suffer from being impossible to know completely (think of Example 4, where some of the outcomes you care about will occur in the future). So, we instead work with an empirical distribution for \(X\), which is the distribution of some observations of values of that variable. In other words, after someone makes observations of values of those variables and records them, that data is the empirical distribution.
Example 9. The set of all petal lengths of Iris flowers have a population distribution. Using the Iris data set, we have an empirical distribution. For example, we can find its mean, getting: average_petal_length = 3.758.
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
iris_data = pd.DataFrame(load_iris()['data'], columns=load_iris()['feature_names'])
average_petal_length = np.mean(iris_data['petal length (cm)'].values)
When the data collected is “representative,” then the empirical distribution is a good approximation to the population distribution. In that case, quantities like the empirical mean, computed above for Iris petal lengths, will be good approximations of the population mean.
There are situations, when you have a large data set, where you want to simulate the process of making observations and getting data. Numpy is very useful for this, by using its random package of functions. For example, if your data set is in a list called data that has length L, one data point in each index, you can pick a random index by calling np.random.randint(L). Using this, you can pull up a random data point by calling data[np.random.randint(L)]. Repeating this some number of times, you can simulate collecting data. This can be a very useful technique when determining model parameters to make it less sensitive to noise.
In fact, you don’t need to restrict yourself to numbers only. Doing so just simplifies the discussion, so it is done this way most of the time. For an example of a random variable that does not have numerical values: the value being the manufacturer of the next car to be seen to travel down a highway. ↩
One could, instead, choose to have \(\mathcal T\) be the codomain, not requiring the function to be onto. There is no meaningful difference in the choice, but it is important to remember which choice was made. ↩
In full generality, there are some technicalities on the target space that must be set. There is a collection of its subsets that have a measure, turning \(\mathcal T\) into something called a measurable space, and probabilities all come from subsets that have a measure. Such concerns essentially can be avoided when dealing only with discrete random variables. When I get to continuous random variables, I will try to keep it as simple as possible. ↩
Or countably infinite, but we will completely ignore that setting. ↩
While \(p\) does not make reference to the variable \(X\), I am trying to walk a fine line between consistency with the MML book‘s notation, and choosing helpful notation that doesn’t collide with other notation in use. Whenever it seems needed, I will point out that I am discussing the pmf of a random variable. ↩
The figure is not really a histogram, just bars to show the value of the pmf. ↩
We have made the assumption that the searches are independent, pairwise (see the section on independent random variables). ↩
This is pretty bad notation, since the underlying random variable is different (its distribution uses conditional probabilities, given \(Y=y\) for that particular \(y\); so it does not have the same sample space). The authors of the MML book admit as much. However, it is commonly used, so we will use it. ↩
In some sense, one could argue that this definition of the cdf is the “natural” generalization to higher dimensional random variables. It should be noted, though, that using it means that you are really thinking of \(X\) and \(Y\) individually, as separate random variables. To some degree, you are treating the coordinates (or features) as more essential than the points themselves. This will have a heavy hand on the results of doing analysis on sample points from that random variable. ↩
In this probability table, \(X\) is the minimum of rolling two 4-sided dice (equal chance of each number 1 through 4), and \(Y\) is the result of subtracting the largest from the smallest (or getting 0 if the rolled numbers are equal). ↩
When you start with \((X, Y)\) and its joint distribution, the distribution \(p_X\) is sometimes called the marginal distribution on \(X\). ↩
When the events involved have positive probability, then this is the same as saying that, for every \(x, y\), the probability of the event \(X\le x\) is the same as its conditional probability given that \(Y\le y\). ↩
Beginning with the smallest \(x\) (and smallest \(y\)) in the respective target spaces, a double induction argument shows that these are the same. ↩
Especially, uncountably infinite, like an interval of \(\mathbb R\). ↩
To fully treat the notion of probabilities on infinite sets requires a discussion of a mathematical construction called a measure. I will avoid this and try to keep it simple, defining probabilities on \(\mathbb R\), or certain “nice” subsets of \(\mathbb R\). ↩
Recall that an integral, by definition, is the limit of a sum of (signed) areas of rectangles, the rectangles thinning as you have more of them. If you have care with the heights of the rectangles, then this will allow you to generalize a finite sum of probabilities to the infinite setting. ↩
If you happen to have a physics background, a density is always some type of derivative. ↩
For \(\mu = 0\) and \(\sigma = 1/\sqrt{2}\), there is a nice way to use polar coordinates to show the integral equals 1. Some easy alterations to the calculation then show how it generalizes to arbitrary \(\mu\) and \(\sigma\). ↩
In this post, I will address a few topics that are considered in the realm of geometry, since they are concerned with angles and lengths. However, they also have a close relationship to linear algebra, and they are often discussed in a first course on linear algebra.
The dot product and angles.
If \(\bf u\) and \(\bf v\) are vectors in \(\mathbb R^n\), their dot product \({\bf u}\cdot{\bf v}\) is the number that is equal to \({\bf u}^{\top}{\bf v}\) (thinking of this as matrix multiplication). More explicitly, if \({\bf u} = (u_1,u_2,\ldots,u_n)\) and \({\bf v} = (v_1,v_2,\ldots,v_n)\) then
linear on each side — on the right side this means, \({\bf u}\cdot({\bf v}+{\bf w}) = {\bf u}\cdot{\bf v}+{\bf u}\cdot{\bf w}\) and also \({\bf u}\cdot(\alpha{\bf v}) = \alpha ({\bf u}\cdot{\bf v})\). On the left side it works the same.
positive definite: the inequality \({\bf u}\cdot{\bf u} \geq 0\) is true, and it equals 0 only if \({\bf u} = \vec{\bf 0}\).
These three properties are all direct consequences of how multiplication of real numbers works, and how the dot product is defined. For example, the positive definite property holds because for a real number \(u_i\), it must be that \(u_i^2\geq 0\), and the only way to get 0 when adding non-negative numbers together is if each of the numbers is zero, each \(u_i^2=0\). And so, \(u_i=0\).
Example. Let \({\bf a} = (4, 1, 0)\), \({\bf b} = (1,-4,1)\), and \({\bf c} = (0, 2, 2)\). Then, since \({\bf a}\cdot{\bf b} = 4 – 4 = 0\), you can quickly see that
The length (also called the Euclidean norm, or \(\ell_2\)-norm) of a vector \(\bf u\) in \(\mathbb R^n\) is equal to \(\sqrt{{\bf u}\cdot{\bf u}}\). To help this make sense, if the vector is in the plane (\(\mathbb R^2\)), and say that \({\bf u} = (u_1, u_2)\), then by moving horizontally along the \(x\)-axis to get to \(u_1\), and then moving vertically to get to \(u_2\), you arrive at the end of the arrow representing \(\bf u\) (see the figure). The Pythagorean theorem says that the length of that arrow must be \(\sqrt{u_1^2 + u_2^2} = \sqrt{{\bf u}\cdot{\bf u}}\). When the vector is in \(\mathbb R^n\), you can do this “two coordinates at a time,” applying the Pythagorean theorem each time. 1
Some properties of the norm.
For a scalar \(\alpha\), you get \(|\alpha{\bf u}| = |\alpha||{\bf u}|\). 2
An inequality that relates the dot product to the length of vectors is the Cauchy-Schwarz inequality. It says
\(|{\bf u}\cdot{\bf v}| \leq |{\bf u}||{\bf v}|\)
for any vectors \({\bf u}, {\bf v}\) in \(\mathbb R^n\). You can remember this inequality by remembering the relationship of the dot product to the angle \(\theta\) between two vectors, likely encountered in a first linear algebra course or in Calculus III:
Since \(-1\leq \cos\theta\leq 1\), this equation implies the Cauchy-Schwarz inequality. 3
In addition, the norm satisfies the triangle inequality. That is, for any \({\bf u}, {\bf v}\), it must be that \(|{\bf u}+{\bf v}| \leq |{\bf u}| + |{\bf v}|\).
Proof. If \({\bf u}\cdot{\bf v} \geq 0\) then \({\bf u}\cdot{\bf v} = |{\bf u}\cdot{\bf v}|\). And so, using the Cauchy-Schwarz inequality,
Taking the square root of both sides gives the desired inequality.
If \({\bf u}\cdot{\bf v} < 0\) instead, then you don’t need to use \(|{\bf u}\cdot{\bf v}| \leq |{\bf u}||{\bf v}|\). A negative dot product gives
The distance between two points \(\bf x = (x_1, x_2,\ldots, x_n)\) and \(\bf y = (y_1, y_2,\ldots, y_n)\) is defined to be the length of the vector beginning at one point and ending at the other. So, the distance equals \(|{\bf x} – {\bf y}|\).
Angles and Orthogonality.
In the section above, we saw that the angle between two vectors \(\bf u\) and \(\bf v\) is defined through the dot product. When \({\bf u}\cdot{\bf v}\) is positive, then the angle is acute (less than \(\pi/2\)), since those are the angles where the cosine is positive. When \({\bf u}\cdot{\bf v}\) is negative, then the angle is obtuse (more than \(\pi/2\)).
Vectors are called orthogonal if \({\bf u}\cdot{\bf v} = 0\) (so, the angle is \(\pi/2\)). Something nice happens if you have a basis where each pair of basis vectors is orthogonal, and each one has length one (an orthonormal basis):
Say that \(\{\bf{q_1, q_2,\ldots, q_k}\}\) is an orthonormal basis and that \({\bf x} = \alpha_1{\bf q_1}+\alpha_2{\bf q_2}+\ldots+\alpha_k{\bf q_k}\). Then \({\bf x}\cdot{\bf q_i} = \alpha_i\) for each \(i\).
Example. Let \(\mathsf Q = \{{\bf q_1}, {\bf q_2}, {\bf q_3}\}\), where
Then find \({\bf x} = \begin{bmatrix}1\\ 0\\ 1\end{bmatrix}\) as a linear combination of the vectors in \(\mathsf Q\).
Note that \(|{\bf q_1}| = |{\bf q_2}| = |{\bf q_3}| = 1\). Also, \({\bf q_1}\cdot{\bf q_2} = {\bf q_1}\cdot{\bf q_3} = {\bf q_2}\cdot{\bf q_3} = 0\). And so \(\mathsf Q\) is orthonormal. Thus, since
\({\bf x}\cdot{\bf q_1} = \dfrac{1+\sqrt{2}}{2},\ {\bf x}\cdot{\bf q_2} = \dfrac{-1+\sqrt{2}}{2},\ \textrm{and }{\bf x}\cdot{\bf q_3} = \dfrac{1}{\sqrt{2}}\)
it must be that \({\bf x} = \dfrac{1+\sqrt{2}}{2}{\bf q_1} + \dfrac{-1+\sqrt{2}}{2}{\bf q_2}+\dfrac{1}{\sqrt{2}}{\bf q_3}\).
Every subspace has an orthonormal basis.
If \(V\) is a (non-zero) subspace of \(\mathbb R^n\) and has a basis \(\{{\bf b_1},{\bf b_2},\ldots,{\bf b_k}\}\), where \(1\leq k\leq n\), then there is a process to find an orthonormal basis of \(V\), called the Gram-Schmidt process. It proceeds as follows, first getting a set \(\mathsf Q\) of vectors that are pairwise orthogonal.
The Gram-Schmidt process. First, set \({\bf q_1} = {\bf b_1}\). Now, instead of using \({\bf b_2}\) for the next vector, we subtract a vector from it in order to get a vector that is orthogonal to \({\bf q_1}\). More precisely, we define
The vector \(\left(\dfrac{{\bf b_2}\cdot{\bf q_1}}{{\bf q_1}\cdot{\bf q_1}}\right){\bf q_1}\) is called the projection of \(\bf b_2\) to \(\bf q_1\). So, to get \(\bf q_2\) you subtract the projection of \(\bf b_2\) to \(\bf q_1\) from the vector \(\bf b_2\). We can directly check that \({\bf q_1}\cdot{\bf q_2} = 0\):
The vectors \(\bf q_1\) and \(\bf q_2\) will be elements of \(\mathsf Q\). Define the other elements, \({\bf q_3},\ldots,{\bf q_k}\), by starting with the corresponding \(\bf b_i\) vector and, for each \(j < i\), subtract its projection to \(\bf q_j\). For example,
This gives a set \(\mathsf Q = \{{\bf q_1}, {\bf q_2}, \ldots, {\bf q_k}\}\) of pairwise orthogonal vectors (\({\bf q_i}\cdot{\bf q_j} = 0\) if \(i \ne j\)). 4 We know that \(\mathsf Q \subset \mathsf V\) since each of the vectors in \(\mathsf Q\) is defined as a linear combination of vectors known to be in \(\text{Span}({\bf b_1},\ldots,{\bf b_k})\).
Exercise. Explain why \(\{{\bf q_1}, {\bf q_2}, \ldots, {\bf q_k}\}\) is an independent set of vectors.
Finally, to get an orthonormal basis one needs to multiply each \(\bf q_i\) by the reciprocal of its length. So, define
\({\bf u_i} = \dfrac{1}{|{\bf q_i}|}{\bf q_i}\)
for each \(1\le i\le k\). Then \(\{{\bf u_1}, {\bf u_2}, \ldots, {\bf u_k}\}\) is an orthonormal basis of \(\mathsf V\).
We noticed before that, for the purpose of writing \(\bf x\) as a linear combination of basis vectors, it is very nice to have the basis be orthonormal (since, in that case, the components of \(\bf x\) are its dot products with the basis elements). This is quite useful when you want the vector in a given subspace that is closest to \(\bf x\).
Orthogonal Projection.
The goal here is the following. You have a vector \(\bf x\), and a subspace \(\mathsf V\), in \(\mathbb R^n\). Thinking of \(\bf x\) as a point in \(\mathbb R^n\), there is a point in \(\mathsf V\) that is closest to it. Call the vector of this point the orthogonal projection of \(\bf x\) to \(\mathsf V\), or \(\text{proj}_{\mathsf V}({\bf x})\) for short. (The name is because the vector that starts from \(\bf x\) and ends at \(\text{proj}_{\mathsf V}({\bf x})\) is orthogonal to every vector in \(\mathsf V\). 5) You want to be able to find \(\text{proj}_{\mathsf V}({\bf x})\) from \(\bf x\).
Say that we have an orthonormal basis \(\mathsf B_{ON}\) of \(\mathsf V\). Then:
(1) the set of vectors orthogonal to every vector of \(\mathsf V\) is a subspace of \(\mathbb R^n\) (it equals the null space of the matrix whose rows are the vectors in \(\mathsf B_{ON}\));
(2) if you take a basis of that orthogonal subspace and combine (take a union of) all of its vectors with those in \(\mathsf B_{ON}\), then this is a basis \(\mathsf B\) of \(\mathbb R^n\).
So, our vector \(\bf x\) can be written as a linear combination of vectors in \(\mathsf B\). If you take the part of the linear combination that only uses the basis vectors that came from \(\mathsf B_{ON}\), then this is the projection \(\text{proj}_{\mathsf V}({\bf x})\). In other words, say that \(\mathsf B_{ON} = \{{\bf u_1}, {\bf u_2}, \ldots, {\bf u_k}\}\) (remember, this is an orthonormal basis of \(\mathsf V\)) and that \(\mathsf B = \{{\bf u_1}, {\bf u_2}, \ldots, {\bf u_k}, {\bf b_{k+1}}, \ldots, {\bf b_n}\}\). Then, writing \(\bf x\) in this basis, say we get
It must be that \(\text{proj}_{\mathsf V}({\bf x}) = \alpha_1{\bf u_1}+\ldots + \alpha_k{\bf u_k}\). Wonderfully, we don’t even have to worry about the other vectors \({\bf b_{k+1}},\ldots,{\bf b_n}\), or their scalar coefficients, because:
Example. Let \({\bf a} = (4,2,0)\) and \({\bf b} = (1,-1,1)\). The subspace \(\text{Span}({\bf a}, {\bf b})\) is a plane in \(\mathbb R^3\). Find the orthogonal projection of \({\bf x} = (0, 8, 4)\) to this plane.
First, find an orthonormal basis for the plane. Let \({\bf q_1}={\bf b}\) and, following Gram-Schmidt, define
Now we can get the projection. Call the plane that we are projecting to \(\mathsf P\). Calculate that \({\bf x}\cdot{\bf u_1} = \dfrac{-4}{\sqrt{3}}\) and \({\bf x}\cdot{\bf u_2} = \dfrac{\sqrt{56}}{\sqrt{3}}\). And so,
Takeaway: to get the orthogonal projection of a vector \(\bf x\) to a subspace \(\mathsf V\), find an orthonormal basis of \(\mathsf V\), take all the dot products of \(\bf x\) with the vectors in the orthonormal basis (this gives you scalars \(\alpha_1,\ldots,\alpha_k\)), then the projection of \(\bf x\) is the linear combination of the orthonormal basis vectors that uses those scalars.
First, use \(u_1\) and \(u_2\) in their coordinate plane, getting an arrow of length, \(\sqrt{u_1^2+u_2^2}\). Then, use that length and a perpendicular segment of length \(u_3\), and fill these out to a right triangle. By the Pythagorean theorem, the hypotenuse, w/ coordinates \((u_1,u_2,u_3)\), has length
\(\sqrt{\left(\sqrt{u_1^2+u_2^2}\right)^2+u_3^2} = \sqrt{u_1^2+u_2^2+u_3^2}\).
This can be extended to any number of coordinates, giving length \(|{\bf u}| = \sqrt{u_1^2+\ldots+u_n^2}\). ↩
The \(|\ |\) on the \(\alpha\) is absolute value and on the \(\bf u\) it is the norm, or length, in \(\mathbb R^n\). But note, when \(n=1\) the norm is the same thing as the absolute value. You can think of real numbers as vectors in \(\mathbb R^1\). ↩
More accurately, the Cauchy-Schwarz inequality is the more fundamental fact; it can be proved (even for more general inner products) with no reference to angles. In fact, typically the (cosine of the) angle between the vectors is said to be defined by the ratio \({\scriptsize\dfrac{{\bf u}\cdot{\bf v}}{|{\bf u}||{\bf v}|}}\). Cauchy-Schwarz guarantees this definition makes sense. ↩
The verification that \({\bf q_j}\cdot{\bf q_i} = 0\) for all \(j < i\) is almost entirely the same as the verification above that \({\bf q_1}\cdot{\bf q_2} = 0\), by noting that we already know that the various \(\bf q_j\), \(j < i\), are pairwise orthogonal. ↩
You can check this using the dot product — essentially, it’s the Pythagorean theorem. Pick a not-orthogonal vector from \(\bf x\) to a point in \(\mathsf V\), and use the dot product and the Pythagorean theorem to see it must be longer than one starting at \(\bf x\) and which is orthogonal to all vectors in \(\mathsf V\). ↩
This post is the first in a series of posts on basic mathematical background that is important in Machine Learning (ML). While the posts are original (created by me), much of my decision-making regarding the order of topics will come from the book Mathematics for Machine Learning (by Deisenroth, Faisal, and Ong), which is available online in its entirety (in early 2020).
I am writing this, and the other posts, as refreshers on needed mathematics. They are not meant, by any means, to be a first introduction to the content. In this post, I will assume that the reader is familiar with matrix operations (matrix multiplication and the transpose, for example) and with Gaussian elimination.
Vector spaces and notation.
A (real) vector space is a set \(\mathsf V\) of objects (“vectors”) that have an addition operation and a scalar multiplication operation. An addition operation means that when you have two vectors, called \({\bf u}\) and \(\bf v\) say, then there is another vector that you call \({\bf u}+{\bf v}\). Importantly, \({\bf u}+{\bf v}\) must be in the set \(\mathsf V\). A scalar multiplication operation means that when you have a vector \({\bf v}\) and a real number \(\alpha\), then there is another vector called \(\alpha{\bf v}\) that is determined by these (it is also in \(\mathsf V\)). The addition and scalar multiplication operations have to satisfy some reasonable conditions (see the Axioms in the table here, think of the F in the table as \(\mathbb R\), the real numbers).
Example. The example that everyone encounters in their first linear algebra class: the vectors are ordered tuples, i.e. lists, of real numbers (each number in the list is called a component of the vector), the addition is defined by adding component-wise, and the scalar multiplication is defined by multiplying every component in the vector by that real number.
Here is an example of the addition and scalar multiplication in action, using vectors with 3 components: let \({\bf u} = (-2,0,1.5)\) and \({\bf v} = (1,3,1)\), and say the scalar \(\alpha = 1.2\). Then
I did not specify it above, but the intended vector space was \(\mathbb R^3\), containing all ordered triples of real numbers. For any integer \(n\geq 1\) there is a vector space called \(\mathbb R^n\) which is defined in this manner, but with vectors that have \(n\) components.
There are many other examples of vector spaces that appear both within mathematics and in applications (game theory in economics, for example). For the purposes of ML, however, we will mostly, though not always, stick to subspaces (defined below) of \(\mathbb R^n\). Even so, it is quite useful to be able to write and reason about these things with notation that does not require writing out the components.
Notation Ground Rules. Some conventions are in order. Lowercase, bold-face characters will denote vectors (like I did with \({\bf u}\) and \({\bf v}\) above). Scalars will also be lowercase, but not in bold; I will use Greek letters for the scalars, much of the time. Also, \(\mathsf V\) will denote the vector space containing all the vectors (for this post, at least). Other named sets of vectors (e.g. a basis that has been given a symbol) will also be in a sans serif font (e.g. \(\mathsf U\) or \(\mathsf B\)). Matrices will be uppercase and italicized (e.g. \(A\)).
By default, a vector will be a column vector (like those with brackets above). In Other Words (IOW), if it has \(n\) components, then it is essentially the same as an \(n\times 1\) matrix. This works nicely with matrix multiplication.
If I need to write out the components of a vector horizontally, within a line, I’ll use parentheses: \({\bf u} = (-2,0,1.5)\). However, you should still think of the vector as a column vector. To refer to a row vector, I’ll use either transpose notation, \((-2,0,1.5)^\top\), or will use \([-2,0,1.5]\).
If I write \([{\bf u_1}, {\bf u_2}, \ldots, {\bf u_k}]\), then this is a horizontal array of \(k\) column vectors. IOW, it is a matrix with \(k\) columns, the columns coming from each \({\bf u_i}\). 1 (Notice the square brackets, for matrix.)
Linear combinations. Say that \(\mathsf{U}=\{{\bf u_1}, {\bf u_2},\ldots,{\bf u_k}\}\) is a set of vectors in \(\mathsf V\). A linear combination of vectors in \(\mathsf U\) is a vector which equals
where each \(\alpha_i\) is some scalar (maybe even 0). The span of \(\mathsf U\) is written \(\text{Span}(\mathsf U)\), or \(\text{Span}({\bf u_1},{\bf u_2},\ldots,{\bf u_k})\), and is the set of all linear combinations of those vectors — that is, the vectors that can found to be equal to a linear combination of the ones in \(\mathsf U\). The vector \(\vec{\bf 0}\), which has 0 for every component, is always in the span — choose every \(\alpha_i\) to be 0.
Example. Like in the first example, let \({\bf u} = (-2,0,1.5)\) and \({\bf v} = (1,3,1)\) be vectors in \(\mathbb R^3\). Then \((-5, 9, 9)\) is a linear combination of \({\bf u}\) and \({\bf v}\) since
where \(A\) is the matrix \([{\bf a_1},{\bf a_2},\ldots,{\bf a_k}]\) and \({\bf v}=(\nu_1,\nu_2,\ldots,\nu_k)\). In terms of the example that was just done,
A set of vectors \(\mathsf U\) is called linearly independent if the only way to get \(\vec{\bf 0}\) in \(\text{Span}(\mathsf U)\) is by setting all the scalars to 0:
\(\sum_{i=1}^k\alpha_i{\bf u_i}=\vec{\bf 0}\qquad\Rightarrow\qquad\alpha_i=0\text{ for all }i.\)
Otherwise (if there is a different way to get \(\vec{\bf 0}\)), the set is called linearly dependent. There are a number of equivalent ways to think about linearly (in)dependent sets of vectors. One of them is that the set is linearly dependent if and only if one of the vectors in the set is contained in the span of the others.
Exercise. Explain why saying that “the only solution to \(A{\bf v} = \vec{\bf 0}\) is \({\bf v}=\vec{\bf 0}\)” is the same as saying “the columns of \(A\) make a linearly independent set.”
When \(\mathsf V=\mathbb R^n\), then a linearly independent set can have at most \(n\) vectors in it. For example,
is linearly dependent. It cannot be independent since the vectors are in \(\mathbb R^2\), but there are 3 of them.
Basis of a space. Using \(\mathsf U\) again to denote a set of vectors in a vector space \(\mathsf V\), \(\text{Span}(\mathsf U)\) is a vector space on its own (using the operations that you had already from \(\mathsf V\)): sums and scalar multiples stay within the span, and \(\vec{\bf 0}\) is in the span. This is called a subspace of \(\mathsf V\). Usually, even when working with vectors in \(\mathbb R^n\) for some \(n\), you really care about working in a subspace.
If \(A\) is an \(m\times n\) matrix, then the set of all vectors in \(\mathbb R^m\) which are equal to \(A{\bf v}\) for some \({\bf v}\), is a subspace of \(\mathbb R^m\). It is the span of the columns of \(A\) (remember the comment above about linear combinations and multiplying by a matrix), and it’s often called the column space of \(A\).
Recall the null space of \(A\), which is the set of all \({\bf v}\) so that \(A{\bf v} = \vec{\bf 0}\). This is a subspace of \(\mathbb R^n\) (when \(A\) is \(m\times n\)), though at this point in the discussion it is not readily apparent how to find a set of vectors whose span equals it.
If \(\text{Span}(\mathsf U)\) is the entire vector space \(\mathsf V\), then call \(\mathsf U\) a generating set (or spanning set) of vectors for \(\mathsf V\). The set \(\mathsf U\) is a basis if it is both linearly independent and a generating set. Informally, a basis is a generating set that has no “redundancy,” meaning that if you took out any vector from that basis, it would no longer be a generating set for \(\mathsf V\).
Example. Let \({\bf b_1}=(1,-1)\) and \({\bf b_2}=(1,2)\). Then \(\{{\bf b_1}, {\bf b_2}\}\) is a basis for \(\mathbb R^2\).
If you remember, an \(n\times n\) matrix having \(n\) pivot columns means that the equation \(A{\bf x} = {\bf y}\) has a solution, regardless of which \(n\)-vector \({\bf y}\) is on the right. This also means the null space just contains the zero vector (the only solution to \(A{\bf x} = \vec{\bf 0}\) is \({\bf x} = \vec{\bf 0}\)). The statement about \(A{\bf x}={\bf y}\) having a solution means the columns generate \(\mathbb R^n\). The statement about the null space means that the columns are linearly independent, and so if an \(n\times n\) matrix has \(n\) pivots, then its columns are a basis for \(\mathbb R^n\).
Doing elimination on the matrix \([{\bf b_1}, {\bf b_2}]\) gives
In \(\mathbb R^n\), the vector \({\bf e_i}\) is the one whose components are all zero except in position \(i\), and the component in position \(i\) is 1. The set \(\{{\bf e_1}, {\bf e_2},\ldots,{\bf e_n}\}\) is called the standard basis of \(\mathbb R^n\). 2 This is the basis with respect to which we typically work in \(\mathbb R^n\). However, for each \(n\), there are many different choices of basis (take the columns of any invertible \(n\times n\) matrix). Using a different, not standard, basis can often be quite helpful.
Every vector space has a basis (and so does every one of its subspaces — remember, they are a vector space also). Moreover, if for some number \(d\) there is a basis of \(\mathsf V\) that has \(d\) vectors in it, then every basis of \(\mathsf V\) has exactly \(d\) vectors in it. 3 This number is called the dimension of \(\mathsf V\).
Remember that the null space of a matrix \(A\) is the set of all \(\bf v\) so that \(A{\bf v} = \vec{\bf 0}\). Remind yourself how to find a basis of the null space of a matrix. (Some help. 4)
Exercises. Find a basis for the null space of \(\begin{bmatrix} 1 & 1 & 1\\ 1 & 2 & 4 \\ 1 & 3 & 7 \\\end{bmatrix}\).
Find a basis for the plane \(3x – y + 6z = 0\) in \(\mathbb R^3\) (think of how the equation defines a null space).
Components in a given basis. Let \(\mathsf B=\{{\bf b_1}, {\bf b_2},\ldots,{\bf b_k}\}\) be a basis for \(\mathsf V\). Remember that \(\mathsf B\) is a generating set, and so every \({\bf v}\) in \(\mathsf V\) is in \(\text{Span}(\mathsf B)\). That is, fix a vector \({\bf v}\). There is a choice of \(\alpha_i\), for \(1\le i\le k\), so that
\({\bf v} = \sum_{i=1}^k\alpha_i{\bf b_i}\).
This would be true of any generating set, but since \(\mathsf B\) is linearly independent, there is only one choice for those \(\alpha_i\) scalars. Once \({\bf v}\) is chosen (as well as the basis), there is no ambiguity as to what each \(\alpha_i\) will be, and we call these the components (or coordinates) of \({\bf v}\) in the basis \(\mathsf B\).
Exercise. Let \({\bf e_1}=(1,0)\) and \({\bf e_2}=(0,1)\) be the standard basis vectors in \(\mathbb R^2\). Let \(\mathsf B=\{(7,-1), (3,1)\}\), which is a basis of \(\mathbb R^2\). Find the components of \({\bf e_1}\) in the basis \(\mathsf B\), and the components of \({\bf e_2}\) in the basis \(\mathsf B\).
Linear mappings and change of basis.
Remember that linear transformations (linear mappings) are functions from one vector space to another (or possibly the same vector space) which “respect” linear combinations. For example, say \(f\) is a linear mapping from \(\mathbb R^{3}\) to \(\mathbb R^{10}\).5 Then for any \(\bf u\), \(\bf v\) \(\in\mathbb R^3\), and any scalars \(\alpha, \beta\), it is true that
For example, \({\bf x}=(3,2,-3)\) is a linear combination of \({\bf u}=(1,0,-1)\) and \({\bf v}=(0,1,0)\). So, if you know what \(f({\bf u})\) and \(f({\bf v})\) are, then
and so you know where \(f\) must send \(\bf x\) based on what you already know.
Linear mappings make a very important class of functions on vector spaces; they get used a lot. The fact that they respect linear combinations means that the entire linear mapping is determined by where it sends vectors in a basis. Let \(f:\mathsf{V}\to\mathsf{W}\) be a linear mapping from \(\mathsf V\) to \(\mathsf W\). Choose a basis \(\mathsf B\) of \(\mathsf V\), and a basis \(\mathsf C\) of \(\mathsf W\). For each vector \(\bf b_j\) in \(\mathsf B\), \(f({\bf b_j})\) has components in the basis \(\mathsf C\). Record these: say that \(\mathsf C = \{{\bf c_1}, \ldots, {\bf c_m}\}\) and that \(\mathsf B=\{{\bf b_1},{\bf b_2},\ldots,{\bf b_n}\}\). For each \(j=1,\ldots,n\), suppose that
IOW, use \(a_{i,j}\) to denote the component of \(f({\bf b_j})\) corresponding to \(\bf c_i\). Let \(A^{CB}_f\) be the \(m\times n\) matrix with entries \(a_{i,j}\). The mapping \(f\) is completely described by taking, for each \({\bf v}\) in \(\mathsf V\), the tuple of components for \(\bf v\) in the basis \(\mathsf B\), and then multiplying this tuple, as a coordinate vector, by \(A^{CB}_f\). The coordinate vector you get as a result is the tuple of components of \(f({\bf v})\) in the basis \(\mathsf C\). This description is unambiguous because, as was pointed out, the components, once a basis is fixed, are unambiguous.
When the linear mapping is from \(\mathbb R^n\) to \(\mathbb R^m\), for some \(m, n\), and the bases are the standard bases on each side, then we will simply write \(A_f\) for the matrix.
Exercise. Using \(\bf e_1, e_2, e_3\) for the standard basis vectors in \(\mathbb R^3\), write down the matrix \(A_p\) associated to the linear mapping that satisfies:
The mapping here is \(p:\mathbb R^3\to\mathbb R^3\), and you should use the standard basis for the “\(\mathsf B\)” and “\(\mathsf C\)” from above. 6
To get you started, note that the first column of the matrix should be the vector \((0.5,0,0.5)\).
Change of basis. You often want to be able to change the basis being used to represent the matrix. Say that \(\mathsf B_1\) is a basis of \(\mathsf V\) and \(\mathsf C_1\) a basis of \(\mathsf W\), and \(f:\mathsf V\to\mathsf W\) a linear mapping. If \(\mathsf B_2\) and \(\mathsf C_2\) are other bases (for \(\mathsf V\), \(\mathsf W\) respectively), then you can get the matrix \(A^{C_2B_2}_f\) from the matrix \(A^{C_1B_1}_f\) as follows.
The identity map \(id\) (taking every vector in \(\mathsf V\) to itself) has a matrix that switches basis representations, from \(\mathsf B_2\) to \(\mathsf B_1\). Note that this does not mean it takes a vector \(\bf b_j\) in \(\mathsf B_2\) to some vector in \(\mathsf B_1\). The vector \(\bf b_j\) is taken to itself, like all other vectors, but you figure out the components of \(\bf b_j\) in the basis \(\mathsf B_1\), and that is the column of the matrix corresponding to \(\bf b_j\). The notation would be \(A^{B_1B_2}_{id}\), but abbreviate this notation to \(I_B\). Likewise, for shorthand, write \(I_C\) for the matrix corresponding to the identity function on \(\mathsf W\) which switches from basis \(\mathsf C_2\) to basis \(\mathsf C_1\). Then
\(A^{C_2B_2}_f = I_C^{-1}A^{C_1B_1}_fI_B\).
Exercise. Let \(p\) be the mapping discussed in the previous exercise. Use the formula just found to get the matrix \(A^{B_2B_2}_p\), where
Hint: Writing \(\mathsf B_1\) to mean the standard basis, the matrix \(I_B\) in this case would be equal to \(\begin{bmatrix}1&0&-1\\ 0&1&0\\ 1&0&1\end{bmatrix}\).
I’ll only do this when there are understood components to the vectors \({\bf u_i}\). ↩
This makes the number \(d\) a property of the vector space, rather than the particular basis. ↩
Let \(A’\) be the reduced row echelon form of the matrix. For each non-pivot column of \(A’\), make a vector with 1 in this column’s position, with a variable for each position of a pivot column left of this non-pivot, and with 0 in all other components. Start with the closest pivot, use the row in \(A’\) with its 1 and solve for the variable so that the product with this row is zero. Then move up a row to the next pivot and do the same. Find values of all variables this way. These vectors for each non-pivot is a basis for the null space. ↩
In function notation, \(f:\mathbb R^3\to\mathbb R^{10}\) is linear. ↩
This mapping is a projection to a plane in \(\mathbb R^3\). ↩
In this post, I review some of the basics in probability. If you are in one of my courses, you have likely seen this before. If you haven’t though, my goal is to be thorough. I will define important terms and, at times, provide links to further reading. The need to understand some probability is essentially inescapable, even for a modest understanding of machine learning.
The Basics
Probability textbooks often use terms like “experiments” and “trials.” The authors of these texts likely choose such words for lack of a better choice. An “experiment” in probability does not necessarily mean what you would think of as a scientific experiment. For example, it could be drawing a card from a deck, or asking a person a question. It is simply the process of observing and, perhaps, recording (as data) the outcome of something that happens — usually, the thing happening is uncertain to some degree.
The thing that happens is the observation (at times, a random draw or observed value). As an initial example, suppose you were to go around asking people what US state they live in (giving them the option to say ‘Not in a US state’). In this terminology, an answer from a person is an observation. The sample space \(\Omega\) is the set of all possibilities for the outcome. So in this example, \(\Omega\) is the set containing the 51 possible answers: the 50 states and ‘Not a US state.’
An event is a subset of \(\Omega\). When \(\Omega\) is a finite set, and \(A\) is a subset of \(\Omega\) (written as \(A\subseteq \Omega\) ), then the probability of the event \(A\) is the size of \(A\) divided by the size of \(\Omega\). 1 In the example above, the subset \(A = \lbrace\)Washington, Oregon, California\(\rbrace\) is an event (which you maybe describe as the respondent “living on the west coast”). The size of \(A\) is \(3\) and the size of \(\Omega\) is \(51\). So, with the given definition of probability of an event, \(A\) has probability \(3/51 \approx 0.059\) (in other words, about \(5.9\%\)). Does this probability seem real to life? Maybe not — there are a lot of people living in California, almost 1/8 of the entire country’s population. What if, instead, the question was being posed to US senators, and the question asked what state they represent? (Throw out the ‘Not a US state’ option, so now \(\Omega\) has size \(50\).) Then the probability is \(3/50\), and this is the proportion of senators from those states.
Write \({\bf P}(A)\) for the probability of an event \(A\subseteq\Omega\). The size of a set \(A\) (the number of elements in it) is often written as \(|A|\), so we have \({\bf P}(A) = |A|/|\Omega|\).
Example 1. Suppose someone is dealt one card from the top of a shuffled deck of playing cards (to be clear, the deck has 52 cards with only one of each card). The observation is the card that is dealt. The subset representing the event that “an Ace is dealt” has four elements: \(\lbrace A\clubsuit, A\diamondsuit, A\spadesuit, A\heartsuit\rbrace\). Its probability is \(4/52 = 1/13\).
Example 2. Someone is dealt two cards, in order, from the top of a shuffled 52-card deck. Now, the observation is the ordered pair of cards dealt. Find the probability that the first card is a Queen (\(Q\)).
The sample space \(\Omega\) is all ordered pairs of cards coming from the one deck. There are just 51 cards left when the second card is drawn, so \(|\Omega| = 52\cdot 51 = 2652\). 2 This matches our event (call the event \(A\)) as long as the first card in the pair is a Queen. The second card is unimportant for the event. So we should look at those pairs of cards which look like:
There are \(51\) possibilities for each. Also, whenever the first cards are different, then the two observations (ordered pairs) cannot be the same observation, so \(4\cdot 51\) does not double count anything and \(|A| = 204\). Thus \({\bf P}(A) = 204/2652 = 1/13 \approx 0.077\).
Exercise. Four cards were already drawn from from a 52-card deck and none of them were face cards (a card that is either a King, a Queen, or a Jack). What is the probability that the next card is a face card?
Counting.
As already seen in the previous example, some facts about determining the size of a set are needed. You likely encountered these in an early math course.
Suppose an ordered pair has \(m\) possibilities for its first object and, regardless of what the first object is, there are \(n\) possibilities for the second object. Then there are \(mn\) different ordered pairs. 3 In this general setting, the number of possibilities for the first and second object might have nothing to do with each other. For example, you roll a six-sided die (6 possibilities) and then you draw a playing card (52 possibilities).
If it is the case that the second object comes from what remains after the first object was observed (like in the example of drawing two cards in succession), then \(n = m – 1\). This thinking can be applied multiple times. For example, if you were to draw \(r=4\) cards in succession from a 52-card deck (with each card distinct, \(m=52\)), then there are \(m(m-1)(m-2)(m-3) = (52)(51)(50)(49)\) possible outcomes. The series of four draws is an ordered tuple: \((card1, card2, card3, card4)\). In terms of factorials, the number of possibilities is \(\dfrac{52!}{48!}\); generally, drawing \(r\) items in order from \(m\) distinct items, gives \(\dfrac{m!}{(m-r)!}\) possibilities.
Finally, there are experiments where you do not care about the order. To count the possibilities in this case, you divide the count in the previous paragraph by the number of different orders. For example, you might want to know the probability of being dealt a certain set of 5 cards from a deck of playing cards, and only care about which cards are in your hand. For each set of 5 cards, there are \(5!\) different ways to order it. 4 Since there are that many ways to order, the number of different sets of 5 cards is (num. of ordered 5-tuples) divided by \(5!\).
In general, the number of possibilities for choosing \(r\) items (ignoring order), from a set of \(m\) items, is
\(\dfrac{m!}{(m-r)!r!} = \binom{m}{r}\).
In that formula, it is essential that the items are distinct from each other. If you had a deck of cards with two Aces of hearts, and no Ace of diamonds, then the number of possible 5-card hands would not be \(\binom{52}{5}\). Using a smaller size deck, this is quite easy to see. Say a deck has three cards, two of which have a 1 printed on them (and are indistinguishable) and the third card has a 2 printed on it. The only two-card hands that can be drawn are: \(\lbrace 1,1\rbrace\) and \(\lbrace 1,2\rbrace\). So, in this non-unique item case, there are not \(\binom{3}{2} = 3\) possibilities.
Here is an example of these counting techniques at work when computing a probability.
Example 3. Suppose two cards are dealt randomly from a single 52-card playing deck. Compute the probability that at least one of the cards is a Queen.
It doesn’t matter whether a Queen is dealt as the first or the second card (or even as both). The sample space \(\Omega\) is the collection of “hands” of two playing cards (which cannot be identical); the event \(A\) contains those hands with at least one Queen.
A Queen in the hand means one of \(Q\clubsuit, Q\diamondsuit, Q\spadesuit, Q\heartsuit\). If there is exactly one Queen, then there are only 48 possibilities for the other card. There are \(4\cdot 48\) such outcomes (the suit of the Queen changes, so this doesn’t double count). If both cards in the hand are Queens, they must come from the set of 4 Queens. There are \(\binom{4}{2}=6\) such possibilities.
Since, \(|\Omega| = \binom{52}{2} = 26\cdot 51 = 1326\),
Given any sample space \(\Omega\) (possibly infinite), a probability function \({\bf P}\) always assigns a number to any event so that 3 conditions are met.
\({\bf P}(A) \geq 0\) for every \(A\subseteq\Omega\)
\({\bf P}(\Omega) = 1\).
If \(A_1,A_2,\ldots,A_n\) are subsets of \(\Omega\) so that \(A_i\cap A_j = \emptyset\) whenever \(i\neq j\), then
\({\bf P}(A_1\cup A_2\cup\ldots\cup A_n) = \sum_{i=1}^n {\bf P}(A_i). \)
When a collection of sets is like \(A_1,A_2,\ldots,A_n\) in the third bullet, so that \(A_i\cap A_j = \emptyset\) when \(i\neq j\), it is common to call them either mutually exclusive or pairwise disjoint.
The finite case we have been discussing, where we define \({\bf P}(A) = |A|/|\Omega|\), satisfies the conditions. Checking this claim is straightforward, at least for the first two bullet points. For the third bullet point, if \(n=2\) (so you only have \(A_1, A_2\)) then the fact that \(A_1\) and \(A_2\) have no common points means that \(|A_1\cup A_2| = |A_1| + |A_2|\). The identity in the third bullet follows from this and an induction argument. Note, the equation in the third bullet will not hold if any pair \(A_i, A_j\) have a common point. 5
For a subset \(A\subseteq \Omega \), write \(\bar A\) for the complement – the set of points in \(\Omega\) but not in \(A\). Then, since \(A \cap \bar A = \emptyset\) and \(A\cup\bar A = \Omega\), you get that
And so it must be that \({\bf P}(\bar A) = 1 – {\bf P}(A)\), regardless of the probability function. This observation can be really helpful for simplifying computations.
For example, recall the question of Example 3. We could consider \(\bar A\) instead — not getting any Queens — and get \({\bf P}(A)\) by finding \(1-{\bf P}(\bar A)\). If the hand does not have any Queens, then both cards come from a set of 48 cards, meaning that \(|\bar A| = \binom{48}{2}\). And so
Sometimes you want to know the probability of one event OR another event occurring (that is, the probability of the union). If the events are mutually exclusive (as defined above), then the probability of the union is the sum of the individual probabilities. However, in general, if \(A_1\) and \(A_2\) are two events then
For the given definition of probability, when \(\Omega\) is finite, deriving the equation above is the same as checking that \(|A_1\cup A_2| = |A_1| + |A_2| – |A_1\cap A_2|\). The intuitive reason this is true is that each point that is in both subsets will only get counted once in \(|A_1\cup A_2|\), but will get counted twice when finding \(|A_1|+|A_2|\). A more careful proof could be given by induction on the size of \(A_1\cap A_2\).
Example 4. Suppose two cards are dealt randomly from a single 52-card playing deck. Compute the probability that at least one of the cards is either a Queen or a King.
Let \(A\) be the event that at least one card is a Queen, and let \(B\) be the event that at least one is a King. In Example 3 above, we computed that \({\bf P}(A) \approx 0.149\). Since the number of Kings in the deck is the same as the number of Queens, it is obvious that \({\bf P}(B) = {\bf P}(A)\).
Consider the set \(A\cap B\). This is the set of two-card hands where there is at least one Queen AND at least one King. So the hand has the Queen of some suit and the King of some suit (4 choices for each one). The number of such hands is \(4\cdot 4 = 16\). This means that
It is so, so,so important to remember that the probability of an event represents its likelihood before the observation. Said in a bit different of a way, if it is an event that has already been observed then its probability is always the same: it has probability 1. Period. Because, well, it happened. From the point of view of the sample space, if you have the outcome already, it is the only possible outcome, so it is the only thing in \(\Omega\).
This makes it natural to think about the following situation: what if you already know that some other event occurred? Does it change the probability of the event you were originally interested in? Yes, sometimes!
If you know event \(B\) happened, your observation must come from within \(B\) because, with that knowledge, the only possible outcomes are those within \(B\).
You are still interested in \(A\) though. Since your observation must come from \(B\), any outcomes in \(\Omega\) that are not in \(B\) are not possible. Therefore, you should just divide by \(|B|\). Moreover, you should only consider the possibilities in \(A\) that are also in \(B\), i.e. the outcomes in \(A\cap B\). So, the probability of \(A\) given \(B\) is:
Note how this makes it so that if every outcome in \(B\) is also in \(A\), then \({\bf P}(A | B) = 1\) (since, in this case, \(A\cap B = B\)).
Exercise. From a shuffled 52-card deck, you deal someone a card, so that you cannot see what it is. That person tells you the card is a face card (remember, this means a King, Queen, or Jack). Given that the card is a face card, find the probability that
the card is the Queen of hearts
the card is a red King (red meaning it is the heart suit or diamond suit)
the card is of the heart suit.
Two events \(A\) and \(B\) are called independent if \({\bf P}(A | B) = {\bf P}(A)\). In words, this says that whether \(B\) occurs or not, the probability of \(A\) would be unchanged. By the way, this implies that the probability of \(B\) is also unchanged by knowing whether \(A\) occurs, since
means that \({\bf P}(B) = \dfrac{{\bf P}(A\cap B)}{{\bf P}(A)} = {\bf P}(B | A)\). Note, that events \(A\) and \(B\) are independent if and only if 6 \({\bf P}(A\cap B) = {\bf P}(A){\bf P}(B)\) also.
Exercise. One card is dealt from a (shuffled) 52-card playing card deck. Which of the following events are independent?
\(A = \) the card is a heart; \(B = \) the card is a face card.
\(A = \) the card is a 9; \(B = \) the card is an Ace.
Exercise. Two cards are dealt from a (shuffled) 52-card playing card deck. Which of the following events are independent? (Read “one card is…” as meaning exactly one card.)
\(A = \) one card is a 2; \(B = \) one card is a Queen.
\(A = \) one card is a heart; \(B = \) one card is an Ace.
There is an identity that goes by the name of Bayes’ Theorem. While it is a rather simple rearrangement of the definition of conditional probabilities, it is a very useful identity to keep in mind when using data to inform your belief (more precisely, to update your belief) of how likely something is. This is called using Bayesian inference, and I will write more on this later.
To see that the identity is true, you can simply write the definition of \({\bf P}(B | A) = {\bf P}(A\cap B)/{\bf P}(A)\) and the definition of \({\bf P}(A | B) = {\bf P}(A\cap B)/{\bf P}(B)\).
As an example of using Bayes’ theorem, let’s see how it works in a game.
Someone tells you they have three coins. There is a “heads” coin, which has heads on both sides. There is a “tails” coin with tails on both sides. Finally, the third coin is normal, one side heads and one side tails. You check and see they are telling the truth, and that there is no way to distinguish any particular heads side from another (and similarly with tails).
After letting you examine the coins, they mix up the three coins in a bag. Then they pick one of the coins and slam it down on a table. It is showing a heads side.
“Let’s make this interesting,” they say. “If this is the “heads” coin, with heads on the other side, then you pay me $20. If it is the normal coin, though, then I’ll pay you $30. Deal?”
Do you take the deal?
The answer comes down to the conditional probability \({\bf P}(\textrm{“heads” coin} | \textrm{heads shows})\). That is, let \(HH\) be the event that the coin is the heads coin and let \(H\) be the (given) event, that there is a head side showing on the table. Then we want to know \({\bf P}(HH | H)\). If the probability is 1/2, then you should take the deal, since, in that case, on average you would win half the time and the reward is larger than the loss. However, …
Note the total probabilities \({\bf P}(HH) = 1/3\) and \({\bf P}(H) = 1/2\), the latter since the total number of sides of coins is 6, and 3 of them are heads. Also, we know that \({\bf P}(H | HH) = 1\) since, if the coin has heads on both sides, it has to show heads. By Bayes’ Theorem,
In other words, given that a heads side is showing, it is a 2/3 chance that the coin is the “heads” coin. But then, on average, your reward is \((2/3)(-20)+(1/3)(30) = -10/3\) dollars. You’ll lose money on average, so you don’t take the deal.
At least, this is very often how probability is defined when the sample space is finite. It is, however, not necessary (and, in cases, not the best choice) to define the probability this way. ↩
You don’t put the first card back before the second draw! ↩
In the notation of sets, if \(U, V\) are finite sets with \(|U|=m\) and \(|V|=n\), then \(|U\times V|=mn\). ↩
Each way is an ordered tuple of 5 cards, from just 5 possible cards. So there are \(5!/(5-5)! = 5!\) possibilities. ↩
For the probability function given on finite sets. ↩
Implicit to our discussion of independent events is the requirement that both events have non-zero probability. ↩
Decision trees are another example of a powerful model function for machine learning. They are often used for binary classification, or for multi-classification (a classification problem where there are multiple categorical labels, rather than just two). Similar to how neural networks are capable of producing any function, if the “size” of decision trees is unrestrained1 then it can produce an arbitrarily close approximation to any desired classification of points in the input space.
1.And computation time considerations are ignored
Construction of a decision tree. Decision trees are strongly related to decision stumps. In fact, a decision stump is a decision tree with depth 1 (the depth of a tree is defined below). In some circumstances, it is possible to think of a decision tree as a collection of nested decision stumps — on smaller and smaller subsets of the data.
Often when decision trees are discussed, the data is converted to points which have binary coordinates (either 0 or 1); however, it is not necessary. Here I will discuss data points with \( d\) numerical coordinates, in \( \mathbb R\), as I have before. If you wish, you can translate to the binary setting by using 0 or 1 to represent whether or not the given data is larger than some threshold in each coordinate.
You begin by splitting the data along some coordinate. In other words, you find a hyperplane \( x_j = b\), where \( 1\le j\le d\) and \(b\) is some number and each data point is on one side: for each \({\bf x}_i \in S\), either \( x_{i,j} < b\) or \( x_{i,j} > b\). This partitions the data into two subsets.
The goal of the process is to recursively partition each side, and ultimately to have points in the same partition have the same label.2 At each split some criterion is used to try to make that happen. Once the split is chosen, you restrict to the two subsets of data points on each side and repeat the process on each subset. Unless other initial choices are made, you continue until a subset (in some region in the original space) has only one label. This region then corresponds to a leaf of the tree, labeled with that one label of the points.3
Below is a potential set of data for binary classification, with only 4 negatively-labeled points (the red points). If the first split made is along the horizontal line \( x_2 = -1.8\) (so \( j = 2\) and \(b = -1.8\)), then all but one point above the line is blue and 3 of the 12 points below the line are red.
Call the top region above the line the positive region of the split; below the line is the negative region. Continuing with nested splits into each side, one could imagine using two more splits in the positive region to isolate the one red point. A way to isolate the three red points in the negative side is with three additional nested splits on that side (see the figure below).
I have drawn the corresponding decision tree to the right, where the dashed lines in the figure correspond to branchings of the tree. The branches of the tree are labeled with + or -, corresponding the side of the line one is following (in each split for this tree, the negative side is the one side that contains red points).
Each leaf of the tree is a node where there are no further splittings beyond it, labeled by the labels of points in that region. The depth of a decision tree is defined to be the maximal number of branches that can be followed from the top node to a leaf. So the depth of the example decision tree is 4.
The resulting model function. Start by determining a decision tree on training data, as above. Then, given test data (not yet “seen” by the model), the decision tree model will check in which partition the test point resides. Then, it labels the test point with the label of the corresponding leaf. Often, in an effort to avoid overfitting, you decide beforehand on the depth of the tree. Since this will result in having more than one label in some of the partitions (and possibly all), the label given to a test point in each leaf is some function of the training labels in the corresponding partition — if the goal is classification, with some categorical labels, the mode could be used; if the goal is regression, with numerical labels, the mean, or the median, could be used.
2.In the unrestricted case, at least. 3.This describes the basic approach in a classification problem. Notice that it is not at all necessary to have only two labels (more labels will cause the leaves to have more than two possible labels). One can also use decision trees for modeling/regression. However, for it to be sensible in that case, you should stop splitting at some point (probably before there is just one label in each partition) and use the average, perhaps, of the labels in a partition in order to give a leaf a label.
How to decide about the splits. Deciding which split (decision stump) to use at each step in building the tree can be done in a number of ways.
One method, in the setting of binary classification, uses ideas that we have already built up. You can use the ERM procedure to find the decision stump with least empirical risk at each step (on the subset of training points in that region). The example set of points above is a bad one for this approach, since the proportion of red points is so small compared to blue points.
Another method is to calculate a quantity called the Information Gain of each possible split (the quantity \( IG\) defined below). You then choose the split that has the maximal \( IG\). Information Gain is calculated as follows.
First, define a function \( e\), called the entropy function, where
\( e(r) = -r\log(r) – (1-r)\log(1-r)\)
for any number \( 0 < r < 1\). When \( r = 0\) or \( r = 1\) then define \( e(r) = 0\).45
Now, to define \( IG\) of a split, first let \( r = {\bf P}(y=1)\), the probability that the label is 1 (in the particular subset you have restricted to before the split). Let \( m\) be the number of points under consideration. In our example above, and after the first split along \( x_2 = -1.8\) was made, say we are in the positive side where there is 1 red point (negatively labeled), and are trying to figure out where to make the second splitting. Then \( m = 38\) (there are \( 38\) points above the horizontal line), and \(r = 37/38\).
Next, for a potential splitting, let \(r_+ = {\bf P}(y=1\ |\ x\text{ on the + side})\) be the fraction of blue (+1 labeled) points on the positive side of that split. Similarly, define \(r_- = {\bf P}(y=1\ |\ x\text{ on the – side})\). Also, let \( m_{\pm}\) be the total number of points on each side. In the continuing example, say the potential split is along the line \( x_1 =-3.9\) (one of the second level splits depicted in the figure above). Then \(r_+ = 35/35 = 1\) and \(r_- = 2/3\).
Now, with \(r, r_+,\) and \(r_-\) defined, the information gain of a split \( s\) is \( IG(s) = e(r) – (\frac{m_+}{m}e(r_+) + \frac{m_-}{m}e(r_-))\). Note that it actually doesn’t matter which side of the line is the positive or negative side, just where the splitting line is. For the continuing example, the information gain is
If you were to use \( IG\) to build a decision tree, then at each step you must find the split, on the subset of the data under consideration, that has maximal information gain.
4.It is important to consider this a definition, not a consequence of the logarithmic expression given above. While you can get \( 0\) from a limit of that expression as \( r\to 0^+\), it is not equal to its value. Trying to say it is might not lead to trouble if you only think this way for the purpose of the calculation of a single entropy function value. However, if you attempt to do any arithmetic and use properties of logarithms, thinking of \( 0\log 0\) as being equal to \( 0\) will easily lead to statements like \( 1 = 2\).
5.The logarithm function can have any base. Most of the time, in discussions of Information Gain, the base is 2 due to the area of study that the idea comes from.