In this post I will discuss expectation of random variables, as well as the variance of a random variable. For notation and conventions, please refer to my earlier post on random variables.
Contents
The expected value of discrete random variables
Consider a discrete random variable \(X:\Omega \to \mathcal T\), with an associated probability function \(p_X\). By definition, the expected value of \(X\) (or expectation, denoted \(\mathbb E[X]\)) is given by1
\(\displaystyle\mathbb E[X] = \sum_{x\in\mathcal T}x\ p_X(x).\)
From one perspective, this is simply the definition and so there is no need for justifying it. However, in order to have some intuition it is helpful to connect it with the more familiar notion of an average, and this is especially clear for a special case (when \(p_X\) is particularly simple).
Example 0. Let \(X\) be a uniformly distributed discrete random variable: suppose that \(|\mathcal T| = m\) and \(p_X(x) = \frac{1}{m}\) for all \(x\in\mathcal T\) (e.g. one roll of a “fair” die, or the flip of a “fair” coin, or what people running polls are theoretically supposed to do when taking a sample). Then \(\mathbb E[X]\) is the average of the numbers in \(\mathcal T\).
Example 1. Recall the example discussed in my initial post on random variables, considering a random variable \(X\) equal to the maximum number showing upon rolling two 6-sided dice. The probability mass function (pmf) for this \(X\) is given by:
\(x\) | 1 | 2 | 3 | 4 | 5 | 6 |
\(p_X(x)\) | 1/36 | 3/36 | 5/36 | 7/36 | 9/36 | 11/36 |
And so, \(\displaystyle\mathbb E[X] = \sum_{i=1}^6 i\ p_X(i) = \dfrac{161}{36} \approx 4.47.\)
In Example 1, the expected value is not the average of the possible values of \(X\) (average of \(\{1,2,3,4,5,6\}\) is \(3.5\)). An often-used phrase is that this is a weighted average, where \(p_X\) is determining the weight of each value of \(X\). 2 Considering the origin of the probabilities \(p_X(i)\), the sample space \(\Omega\) had 36 outcomes, and the numerator in \(p_X(i)\) is the number of outcomes with maximum number \(= i\). Each outcome produces a max: for example, \(\text{max}=6\) comes from 11 different outcomes, and \(\text{max}=5\) from 9 different outcomes, etc. The simple average of these is
\(\dfrac{(11*6 + 9*5 + \cdots + 1*1)}{36} = \dfrac{161}{36},\)
which is precisely \(\mathbb E[X]\). Takeaway: when the probability on \(\Omega\) is defined by \({\bf P}(A) = |A|/|\Omega|\), the expected value of \(X\) is nothing more than the simple average of the numbers \(X(\omega)\), letting \(\omega\) range over \(\Omega\).
The expected value of continuous random variables
Let \(X:\Omega \to \mathcal T\) be a continuous random variable with an associated probability density function \(p_X\). As in the previous post, we want to consider only some “nice” subsets of \(\mathbb R\) for the target space \(\mathcal T\) of \(X\). For now, it is enough to assume that either \(\mathcal T = [a,b]\), an interval with \(a < b\), or \(\mathcal T = \mathbb R\).
The expected value of \(X:\Omega \to [a,b]\) is the value of \(\mathbb E[X] = \int_a^bxp_X(x)\ dx\). 3\).] Like in the discrete case, it is appropriate to think of this number as the weighted average of the values that \(X\) might take.
Exercise 1. Compute the necessary integral to verify that if \(X\) is uniformly distributed on \([a,b]\), then \(\mathbb E[X] = (a+b)/2\).
Exercise 2. Compute the necessary integral to verify that if \(X \sim \mathcal N(\mu, \sigma^2)\) is normal, with \(\mathcal T = \mathbb R\), then \(\mathbb E[X] = \mu\). [Two separate substitutions, or one substitution followed by thinking about odd functions, will succeed.]
Exercise 3. Given a number \(\mu > 0\), define \(p(x) = \frac{1}{\mu}e^{-x/\mu}\) for all \(x\ge 0\). Let \(\mathcal T = [0,\infty)\). Then, for some sample space \(\Omega\), a random variable \(X:\Omega \to\mathcal T\) is called exponentially distributed if there is some \(\mu > 0\) so that \(p_X = p\). 4 Find the expected value of such a random variable.
New random variables from old
Say you have a random variable \(X:\Omega\to\mathcal T\), either discrete or continuous (but sticking to target spaces that are subsets of \(\mathbb R\)). Say you also have a function \(f: \mathbb R\to\mathbb R\). 5 By defining \(Y(\omega) = f(X(\omega))\), for every \(\omega\in\Omega\), you get a new random variable. The target space of \(Y\) is \(f(\mathcal T)\).
Even if you don’t have an explicit understanding of \(p_Y\), you can still figure out \(\mathbb E[Y]\) by using the distribution of \(X\). To begin, say that \(X\) is a discrete random variable (with finite \(\mathcal T\)). Then \(f(\mathcal T)\) is finite and you have \(\mathbb E[Y] = \sum_{y_i\in f(\mathcal T)}y_ip_Y(y_i)\). However, note that
\(\{\omega\in \Omega\ |\ Y(\omega) = y_i\} = \{\omega\in \Omega\ |\ f(X(\omega)) = y_i\}\)
\(= \{\omega\in \Omega\ |\ \text{there is }x \text{ with } X(\omega)=x \text{ and } f(x)=y_i\}.\)
Let \(y_i\in f(\mathcal T)\) and \(\mathcal T = \{x_1,x_2,\ldots,x_n\}\). If \(x_j\ne x_k\) then \(X^{-1}(\{x_j\})\) and \(X^{-1}(\{x_k\})\) are mutually exclusive, and so \(\displaystyle p_Y(y_i) = \sum_{j: f(x_j)=y_i}p_X(x_j)\). Using this equation, and using that an element in \(\mathcal T\) can only be in the preimage of a single element of \(f(\mathcal T)\), and also rearranging summands, we can rewrite the expected value of \(Y\):
\(\displaystyle\mathbb E[Y] = \sum_{y_i\in f(\mathcal T)}y_ip_Y(y_i) = \sum_{y_i\in f(\mathcal T)}y_i\sum_{j: f(x_j) = y_i}p_X(x_j) = \sum_{x_j\in \mathcal T}f(x_j)p_X(x_j).\)
Let’s summarize what has been done. If \(X\) is a discrete random variable and \(Y = f(X)\) for a function \(f\), then we have
\(\mathbb E[X] = \sum_{x_j\in \mathcal T}x_jp_X(x_j) \qquad\text{and}\qquad \mathbb E[Y] = \sum_{x_j\in \mathcal T}f(x_j)p_X(x_j).\)
Very nice. Now consider a continuous random variable \(X\) and another one \(Y = f(X)\) for some function \(f\). It would be desirable to have
\(\displaystyle\mathbb E[Y] = \int_\mathcal Tf(x)p_X(x)\ dx.\)
In fact this does work. 6 I will give an explanation for why in a later section. The slogan is that using continuous variables involves a limit of what happens with discrete random variables.
Exercise 4. Let \(X\) be a continuous random variable, uniformly distributed over \([a,b]\). Find the expected value of \(Y = X^2\). [Your answer should tell you that \(\mathbb E[Y]\) has the following values for the corresponding intervals.]
\([a,b]\) | \([-1,1]\) | \([1,2]\) | \([1,10]\) |
\(\mathbb E[X^2]\) | 1/3 | 7/3 | 37 |
Variance
Fix some number \(\mu\). Given random variable \(X\), define a new random variable \(V = (X – \mu)^2\). If you choose \(\mu = \mathbb E[X]\), then the variance of \(X\) is given by \(\mathbb E[V]\), which is equal to (both discrete and continuous cases):
\(\displaystyle(\text{Discrete }X):\ \sum_{x\in\mathcal T}(x – \mu)^2\ p_X(x) \qquad (\text{Continuous }X):\ \int (x – \mu)^2\ p_X(x)\ dx.\)
When \(X\) is a random variable that has a normal distribution \(\mathcal N(\mu, \sigma^2)\), then the number \(\sigma^2\) in the notation is the variance of \(X\). Variance is a way to measure how diffuse (or “spread out”) the random variable \(X\) is. The larger the variance, the more it is spread out. Since it uses squared differences from the mean \(\mu=\mathbb E[X]\), it is not equal to the expected(average) distance from \(X\) to its mean. 7 The average distance to the mean would be equal to the expected value of the random variable \(|X – \mu|\).
Exercise 5. Let \(X\) be a uniformly distributed random variable over the interval \([a, b]\). For the length of the interval, write \(\ell = b – a\), and notice that \(b – \mathbb E[X] = \mathbb E[X] – a = \ell/2\). Compute the variance of the variable \(Y = X^2\); note that it can be written simply as function of \(\ell\).
Expected value of function of a random variable: Rationale
🐉🐉 Given a continuous random variable \(X\) with target space \(\mathcal T\), and function \(f\) that has \(\mathcal T\) in its domain, define \(Y = f(X)\). For simplicity assume that \(\mathcal T\) equals an interval \([a,b]\), though the reasoning here would extend to other cases. In any case, I also assume that \(f(x)p_X(x)\) is integrable over \(\mathcal T\).
Given a number \(n\in\mathbb N\), choose a subdivision \(P_n\) of the interval: \(a = x_0 < x_1 < \ldots < x_n = b\), with a set of points \(x_i^*\), where \(x_{i-1} \le x_i^* \le x_i\) for \(i = 1,2,\ldots, n\). Define \(\mathcal T(P_n) = \{x_i^*\ |\ i = 1,2,\ldots, n\}\).
With this choice of points and subdivision, a discrete random variable \(X_n:\Omega\to\mathcal T(P_n)\) can be defined as follows. For \(\omega\in\Omega\), there is a unique \(i\) such that \(x_{i-1}\le X(\omega) < x_i\) (in the case \(X(\omega) = x_n = b\), let this \(i = n\). Using that unique \(i\) for \(\omega\), define \(X_n(\omega) = x_i^*\). Then, define \(Y_n(\omega) = f(X_n(\omega))\).
Note that \(\displaystyle p_{X_n}(x_i^*) = {\bf P}(x_{i-1}\le X <x_i) = F_X(x_i) – F_X(x_{i-1})\). And so, using the formula we have for expectation of a discrete random variable,
\(\displaystyle\mathbb E[Y_n] = \sum_{i=1}^n f(x_i^*)p_{X_n}(x_i^*) = \sum_{i=1}^n f(x_i^*)(F_X(x_i) – F_X(x_{i-1}))\).
If you replace \((F_X(x_i) – F_X(x_{i-1}))\) in the above sum with \(\Delta_i F_X\), then this sum becomes recognizable as the Riemann sum for \(P_n\) that would have limit equal to \(\int_{\mathcal T}f(x)\ dF_X.\) The differential \(dF_X\) is equal to \((F_X)'(x)\ dx = p_X(x)\ dx\). And so, our assumptions about integrability imply that 8
\(\displaystyle \lim_n \mathbb E[Y_n] = \int_{\mathcal T} f(x)p_X(x)\ dx.\)
Some details remain. If a slightly stronger assumption is imposed on \(f\), namely that it is piecewise continuous on \([a,b]\), then we can be done. Indeed, if \(f\) is piecewise continuous, then the limit of \(Y_n\) must agree with \(Y\) almost everywhere (at all but, possibly, finitely many points, in fact). Then it must be that the expected value of \(Y\) is the same as the above limit.
- While this is easiest to think about when \(\mathcal T\) is finite (and recall that we had said in the last post that our discrete variables would be just those with \(\mathcal T\) being finite), this definition will work if \(\mathcal T\) is a countably infinite set as long as the infinite sum is absolutely convergent. For example, it could be that \(\mathcal T = \mathbb N\) and that \(p_X(n)\) is exponentially decaying as \(n\to\infty\). For an example of an honest probability mass function \(p_X\) that does not give \(X\) a well-defined expectation? Consider \(p_X(n) = \dfrac{6}{\pi^2n^2}\) (here \(\mathcal T = \mathbb N\)). ↩
- Formally, using the phrase weighted average is simply giving another name to the expected value. But giving something a helpful name can be surprisingly useful. ↩
- Provided that the function \(x\ p_X(x)\) is integrable over \([a,b ↩
- Note that the only possible values of \(X\) are in \([0,\infty)\). Alternatively, one could have \(\mathcal T = \mathbb R\) by setting \(p_X(x) = 0\) for any \(x < 0\). ↩
- In earnest, you only need the domain of \(f\) to contain \(\mathcal T\). ↩
- Provided that the function \(f(x)p_X(x)\) is integrable over \(\mathcal T\). ↩
- Nor is the square root of the variance (the standard deviation). ↩
- The limit involving \(P_n\) such that the maximum width of a subinterval \((x_{i-1}, x_i)\) in \(P_n\) approaches \(0\) as \(n\to\infty\). ↩