Random variables, sampling data

In this post I will go over the basics of random variables and their associated distributions. I will begin with discrete random variables and then will discuss continuous random variables. My focus will mainly be on random variables with values in \(\mathbb R\), though I will briefly mention other types.

Afterwards, I will discuss two different senses that you can think of random variables arising when analyzing data. In the post, some basic knowledge of probability will be assumed. If you feel that you need a review, or don’t remember the jargon, work through the post Probability Refresher.

Before I jump into the details, informally a random variable is just a way of assigning numbers to outcomes (from a sample space).¹ Don’t get too hung up on the word ‘random’ that is used here. The idea is that there is an element of chance to what the value of the variable will be (the chance coming from which outcome occurs), and that’s it. Here are a few examples, trying to illustrate the breadth of subjects where random variables arise.

Example 1. Say that each review on a website must choose a rating of 1 through 5 stars (no fractions of stars being allowed). Call each review an outcome, and make the rating given be the value of the random variable on that outcome.

Example 2. Based on observations, it is estimated that about 1 in 5 “Sun-like” stars have an Earth-sized planet in their habitable zone. Make an outcome be a search for such a planet around 5 Sun-like stars, and the random variable equal to the number of stars in the search where such a planet is found.

Example 3. The height of a 2-year-old child who was selected from a group of 2-year-olds. The child that you select is the outcome and their height is the value of the random variable.

Example 4. How many dollars a customer that visits a particular store/site will spend, given that they do spend money. The outcome is the visit that results in a purchase, the value is the dollar amount.

More formally, let \((\Omega, {\bf P})\) be a probability space (a sample space with a given probability function \({\bf P}\)). A random variable \(X\) is a function that has domain \(\Omega\), and takes values in some target space \(\mathcal T\) (writing \(X:\Omega \to \mathcal T\)). I will use the convention that \(X\) is onto. So, every \(x\in\mathcal T\) equals \(X(\omega)\) for some \(\omega\in\Omega\).² For most of this post, the target space is a subset of the reals, \(\mathcal T \subseteq \mathbb R\), and often a “small” subset of \(\mathbb R\).³

Contents

1 discrete Random variables
2 Joint distributions of two random variables
3 Independent Random variables
4 Continuous random variables
5 Sampling Values of a Random Variable

discrete Random variables

A discrete random variable \(X\) is a random variable where \(\mathcal T\) is finite.⁴ The variables in Example 1 and 2 above are discrete random variables, as are the examples just after this paragraph. Whenever \(\Omega\) is a finite set, \(\mathcal T\) must be finite, making \(X\) discrete. The target space can also be finite when \(\Omega\) is infinite. Until the section on continuous random variables, all random variables should be assumed to be discrete.

Example 5. The maximum number showing after rolling two 4-sided dice. This is a random variable where \(\Omega\) is the set of results you can get by rolling two dice. So, if \((1,4)\) represents the outcome of rolling and having one die show a 1 and the other show a 4, then the value of the random variable is \(4\).

Example 6. The Blackjack score, upon drawing two cards. (For simplicity, assume you are the only one being dealt two cards from the deck.) In this example, \(\Omega\) is the set of two-card hands. For example, if you get a hand of \(\{Q\heartsuit, 9\clubsuit\}\) then the numerical value is \(19\).

In Example 5, \(\mathcal T = \{1,2,3,4\}\). In Example 6, an Ace can be interpreted as 11 points or 1 point. If you decide that the score is the largest possible (without going over 21), then \(\mathcal T = \{4, 5, 6, 7, \ldots, 20, 21\}\).

Every discrete random variable \(X:\Omega\to\mathcal T\) has an associated function \(p_X\) that assigns a probability: \(p_X(x) = {\bf P}(X = x)\) for each \(x\in\mathcal T\). In terms of events, \(p_X(x) = {\bf P}(\{\omega\in\Omega\ |\ X(\omega) = x\})\).

The probability mass function of the Blackjack score of a single two-card hand drawn from a 52-card playing deck.

Following others, when the random variable is unambiguous, I will simply write \(p(x)\). This function \(p\) is called the probability mass function of \(X\) (pmf for short). ⁵ The figure displayed represents the graph of the pmf of the “Blackjack score” random variable of Example 6.⁶ Each bar at \(x\) has height equal to \(p(x)\).

Returning to Example 2. Let \(X\) be the number of stars, from five searches, where a habitable, Earth-sized planet is found. What is the pmf \(p(x)\) of this random variable? Its target space will be \(\{0,1,2,3,4,5\}\).

The probability mass function of successes in 5 searches for a habitable planet around Sun-like stars.

First, we work from the estimate that 1 in 5 of the Sun-like stars have such a planet. So, for each searched star there is a \(20\%\) chance of finding one (and \(80\%\) chance of not finding one).
For none to be found, the \(80\%\) probability event happens 5 times, so \(p(0) = (0.8)^5 = 0.32768\). When 1 planet is found, then exactly one of the searches resulted in the \(20\%\) probabilitiy event; since there are \(5=\binom{5}{1}\) possibilities for which search that occured in, \(p(1) = 5\cdot(0.2)\cdot(0.8)^4 = 0.4096\). When 2 planets are found, then two of the searches resulted in the \(20\%\) probabilitiy event; accounting for the possible ways this can occur, we get \(p(2) = \binom{5}{2}\cdot(0.2)^2\cdot(0.8)^3 = 0.2048\). The full pmf is given by the following table.⁷

\(x\)	0	1	2	3	4	5
\(p(x)\)	0.32768	0.4096	0.2048	0.0512	0.0064	0.00032

Exercise 1. Find the four values of the pmf of the random variable in Example 5.

I will also sometimes write \(X\sim p(x)\), which is read as either \(X\) is distributed according to \(p(x)\), or as \(X\) is sampled from the distribution \(p(x)\).

Exercise 2. Show: If \(p(x)\) is a pmf then \(p(x)\geq 0\), for every \(x\in\mathcal T\), and \(\sum_{x\in\mathcal T}p(x) = 1\).

You can think of \(\mathcal T\) as a sample space. Then the function \(p_X:\mathcal T \to [0,1]\) for a random variable \(X\) determines a probability function on \(\mathcal T\), where the probability of a subset is the sum of values of \(p_X\) on its elements. That is, for \(A\subseteq \mathcal T\), a probability function is given by

\({\bf P}_X(A) = \sum_{x\in A} p_X(x)\).

Exercise 3. Check the claim that \({\bf P}_X\) defines a probability function (recalling the three properties required for probability functions).

It is common to use \({\bf P}(X \leq x)\) for the probability of the outcomes \(\omega\) where \(X(\omega) \leq x\). The cumulative distribution function (or cdf), which I will write as \(F_X:\mathcal T \to [0,1]\), records these probabilities. That is, for each \(x\in\mathcal T\),

\(F_X(x) = {\bf P}(X \leq x)\).

Since our discrete random variables have finite \(\mathcal T\), there is a smallest number in \(\mathcal T\), and just finitely many between that smallest and \(x\). And so, \(F_X(x)\) can be found by summing \(p_X(t)\) over those \(t\in\mathcal T\) where \(t\leq x\):

\(F_X(x) = \sum_{t\leq x} p_X(t)\).

Like with the pmf of \(X\), I will drop the subscript and use \(F(x)\) when it does not create ambiguity.

Example 7. Change the random variable of Example 5 above to the maximum showing after rolling two 6-sided dice (assume each side has equal probability — a “fair” die). Let values of \(X\) be this maximum. Let’s find \(F_X(2)\). Think of there being a first roll and a second roll, the rolls being independent; so there are 36 outcomes in \(\Omega\). There is only one way for \(X = 1\): when both rolls are 1, and so \(p(1) = 1/36\). To get \(X=2\) requires either rolling one 1 and one 2, or rolling two 2’s. The first one happens in two ways, 1 then 2, or 2 then 1. The second possibility only happens in one way. Therefore, \(p(2) = 3/36\) and so,

\(F_X(2) = p(1) + p(2) = \frac{1}{36} + \frac{3}{36} = \frac{1}{9}\).

Exercise 4. Using the \(X\) defined in Example 6, find \(F(20)\).
[Hint: It is easier to find \(p(21) = {\bf P}(X = 21)\). How does this help to find the probability that you want?]

Joint distributions of two random variables

Given two random variables \(X\) and \(Y\), with the same sample space, conditional probabilities have a familiar notation. Specifically, \({\bf P}(X=x\ |\ Y=y)\) is the probability that \(X(\omega)=x\) when \(\omega\) is drawn from \(Y^{-1}(\{y\})\). The joint probability \({\bf P}(X=x, Y=y)\) is the probability of \(X^{-1}(\{x\})\cap Y^{-1}(\{y\})\); that is, the probability in \(\Omega\) of the subset of \(\omega\) such that both \(X(\omega)=x\) and \(Y(\omega) = y\) hold.

Because of the definition of conditional probability, \({\bf P}(X=x\ |\ Y=y) = \frac{{\bf P}(X=x, Y=y)}{{\bf P}(Y=y)}\).

Returning to Example 6. Let \(X\) be the score of the two-card hand, and let \(Y\) be the number of face cards in the two-card hand. This means that the target space of \(Y\) is \(\{0,1,2\}\). Since there are \(\binom{12}{2}\) ways of getting 2 face cards, we have that

\(p_Y(2) = \frac{\binom{12}{2}}{\binom{52}{2}} = \frac{11}{13\cdot 17} \approx 0.05\).

What are \({\bf P}(X=20, Y=2)\) and \({\bf P}(X=20\ |\ Y=2)\) equal to? First, given that you have 2 face cards, your Blackjack score will be 20. This means that \({\bf P}(X = 20\ |\ Y=2)\) should equal \(1\). To check that this works from the definition, the set of outcomes where \(Y=2\) is a subset of the outcomes where \(X = 20\), so the intersection is all outcomes where \(Y=2\), and so \({\bf P}(X=20, Y=2) = {\bf P}(Y=2)\). This means that

\({\bf P}(X=20\ |\ Y=2) = \frac{{\bf P}(X=20, Y=2)}{{\bf P}(Y=2)} = 1\).

You can view the pair \((X, Y)\) as a single random variable; remember, they have the same sample space. However, the target space will have to be in \(\mathbb R^2\), instead of just being a set of real numbers. So, your random variable is \((X,Y): \Omega \to \mathcal T\), where \(\mathcal T\subseteq \mathbb R^2\). As both \(X\) and \(Y\) are discrete (in this section), the \(\mathcal T\) here is a finite set of points in the plane.

The random variable \((X, Y)\) has a pmf, defined by the joint probability, \(p(x, y) = {\bf P}(X=x, Y=y)\). It is also not uncommon to write \(p(x\ |\ y)\) to mean the conditional probability \({\bf P}(X=x\ |\ Y=y)\).⁸ The cdf for \((X, Y)\) is defined by setting

\(F_{(X,Y)}(x, y) = {\bf P}(X \leq x, Y \leq y) = \sum_{t\leq x}\sum_{u\leq y} p(t, u)\)

(where each \(t\) is in the target of \(X\), and \(u\) in target of \(Y\)).⁹

Joint probability table. With two discrete random variables \(X, Y\), you can write a probability table of the joint probabilities \(p(x,y)\) (this describes the pmf of \((X, Y)\)). Of course, once the size of \(\Omega\) gets to be too big, it makes less sense to do this.

Exercise 5. Given the joint probability table below, ¹⁰ find \(F_{(X,Y)}(2, 1) = {\bf P}(X\le 2, Y\le 1)\).

\(Y\ \big\\ \ X\)	1	2	3	4
3	1/8	0	0	0
2	1/8	1/8	0	0
1	1/8	1/8	1/8	0
0	1/16	1/16	1/16	1/16

The Sum Rule. Given a multivariate random variable \((X, Y)\), distributed according to \(p(x, y)\), it is immediate from definitions that \(p(x\ |\ y)p_Y(y) = p(x, y)\). There is another identity, allowing you to find \(p_X\) from the joint distribution.¹¹ It can be quite useful and involves summing over possible values of \(Y\). Suppose that \(\mathcal T_Y = \{y_1, y_2, \ldots, y_k\}\) is the target space of \(Y\). Then we get the

Sum Rule
\(p_X(x) = \sum_{i=1}^kp(x, y_i)\), which also equals \(\sum_{i=1}^kp(x\ |\ y_i)p_Y(y_i)\).

Exercise 6. From the previous joint probability table, use the sum rule to find \(p_X(1)\) and \(p_Y(0)\). Your answers should satisfy \(p_X(1)p_Y(0) = \frac{7}{64}\); note this is different than the joint probability \(p(1, 0) = \frac{1}{16}\).

If it were the case that the joint probability equals the product of the individual probabilities, then the sum rule would follow simply from the fact that \(\sum_{i=1}^kp_Y(y_i) = 1\). But, this is usually not the case (see the exercise 6 above, and the section on independent variables below).

The reasons the identity holds is that (1) preimages of different values are mutually exclusive (if \(y_1\ne y_2\) then \(Y^{-1}(\{y_1\})\cap Y^{-1}(\{y_2\}) = \emptyset\)), and (2) intersections “distribute” over unions. In more detail, let \(S = X^{-1}(\{x\})\). Then, we get

\({\bf P}(S) = {\bf P}(S\cap\Omega) = {\bf P}(S\cap(\bigcup_{1\le i\le k}Y^{-1}(\{y_i\})) = {\bf P}(\bigcup_{1\le i\le k} S\cap Y^{-1}(\{y_i\}))\).

Since the preimages of each of the \(y_i\) are disjoint, this equals

\(\sum_{i=1}^k{\bf P}(S\cap Y^{-1}(\{y_i\})) = \sum_{i=1}^k{\bf P}(X=x, Y=y_i)\),

and so, the sum rule holds.

Independent Random variables

Recall that events \(A\) and \(B\) are independent if \({\bf P}(A\cap B) = {\bf P}(A){\bf P}(B)\). Two random variables \(X, Y\) are called independent if one has \(F_{(X,Y)}(x, y) = F_X(x)F_Y(y)\) for every \((x, y)\) in the target space. ¹² As we are working with discrete random variables, this is the same as having \(p(x, y) = p_X(x)p_Y(y)\) for every \((x, y)\). ¹³

The two random variables determined by the probability table in Exercise 5 are not independent. While random variables (that one tends to consider together) are often not truly independent, it is, nevertheless, a common assumption that is made in order to make computations more manageable (or even feasible, sometimes). In the habitable planet example, one could think of the result of each of the five searches as a random variable, and the computations that were done there assumed that any two of them were independent variables.

Continuous random variables

When the target space of a random variable is infinite, ¹⁴ then probabilities become more subtle. A continuous random variable is a random variable whose target space is some “nice” subset of \(\mathbb R\). To describe what is meant by “nice” is a long road,¹⁵ but it includes intervals of \(\mathbb R\), and a union of finitely many intervals. In this section, all random variables are continuous.

To illustrate the subtlety of an infinite target space, say that \(\mathcal T = [0,2]\), and that I am interested in having every number in the interval have equal probability. It quickly becomes clear that I cannot do things the same as in the discrete case. For example, if the probability \(p(x)\) is constant and positive for every \(x\in[0,2]\), then simply adding these won’t have a total of 1 (whether you think of that sum as not defined, or as infinite, it’s not what we want). OTOH, \(p(x)=0\) for each \(x\) would seem to make the total probability be zero (if sums are being used).

The point is that simple sums don’t work for continuous random variables. They must be replaced with integrals. When one does this, the probability functions are a little different. Specifically, limits are in play (in the background, as derivatives and integrals).¹⁶

However, a continuous random variable is still distributed according to a function, its probability density function (or pdf). I will cover a couple of common classes of pdf’s. For a function \(f\) to be eligible as a pdf of a random variable (with target space \(\mathcal T\)) it must satisfy

- - \(f(x) \geq 0\) for all \(x\in\mathcal T\)
  - \(\int_{\mathcal T}f(t)\ dt = 1\).

Exercise 7. Find a function \(f\), with \(\mathcal T\) being some interval, that satisfies these two conditions, but where \(f(x) > 1\) for some \(x\in\mathcal T\).

You might not be familiar with the form of the integral in the second condition. If \(\mathcal T\) is an interval, let’s say \(\mathcal T = [a,b]\), then the integral \(\int_{\mathcal T}f(t)\ dt\) is the same as \(\int_a^bf(t)\ dt\).

Rather than the values of \(f(x)\) representing the probability of \(X\) taking value \(x\), you should instead think of the probability of \(X \sim f(x)\) taking a value in some range, and this is handled with an integral:

\({\bf P}(x_1 \leq X \leq x_2) = \int_{x_1}^{x_2}f(t)\ dt\).

Uniform Distributions. Let \([a, b]\) be a closed interval of \(\mathbb R\), with \(a < b\). Define a constant function \(f:[a, b] \to \mathbb R\) by setting \(f(x) = \frac{1}{b-a}\). A random variable \(X:\Omega\to [a, b]\) is called uniformly distributed (over \([a,b]\)) if \(X\) is distributed according to this constant function\(f\). Equivalently, for any positive length \(\ell\), and for any \(x\in[a, b]\) that has \(x + \ell \leq b\), we get that

\({\bf P}(x \leq X \leq x+\ell) = \int_x^{x+\ell}\frac1{b-a}dt\).

Note that the value of this integral does not depend on \(x\). This means that for every subinterval of length \(\ell\), regardless of where it appears in \([a, b]\), the probability of \(X\) being in it is \(\frac{\ell}{b – a}\). In this sense, numbers in different parts of the interval have equal likelihood of being a value of \(X\).

The target space of this random variable is \([a, b]\), making \(a\) the smallest possible value, and so

\({\bf P}(X \leq x) = \int_a^x f(t)\ dt = \frac{x – a}{b-a}\).

Thus, the cdf of a variable that is uniformly distributed over \([a,b]\) is \(F(x) = \frac{x-a}{b-a}\).

Exercise 8. For a uniformly distributed \(X\), use the definitions of \({\bf P}(X\leq x)\) and \({\bf P}(x_1\leq X\leq x_2)\) found above, and properties of integrals, to verify the following.

- - For any \(x\in [a, b]\), \({\bf P}(X = x) = 0\).
  - For any \(x\in [a,b]\), \({\bf P}(X \geq x) = 1 – {\bf P}(X \leq x)\).

I’ll call your attention to the fact that, for this uniformly distributed variable, \(\frac{d}{dx}F(x) = \frac{1}{b-a} = f(x)\). The Fundamental Theorem of Calculus tells us that this holds generally, for any continuous random variable on \([a, b]\):

\(\frac{dF}{dx} = \frac{d}{dx}\left(\int_a^x f(t)\ dt\right) = f(x)\).

So the probability density function is the derivative of the cdf.¹⁷

Normal distributions. Given two constants \(\mu\) and \(\sigma > 0\), define a function \(f:\mathbb R \to \mathbb R\) by setting

\(f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\).

A random variable \(X\) which has this \(f(x)\) for its pdf is called normally distributed (with mean \(\mu\) and standard deviation \(\sigma\)). It is common to see people write \(X \sim \mathcal N(\mu, \sigma^2)\).

The pdf of a normally distributed variable with mean -1 and standard deviation 1. The area under the curve to the left of the mean is shaded in blue.

To be sure, it is more of a challenge to see that the integral over \(\mathbb R\) of this function is equal to 1,¹⁸ but it is. The most probable ranges of values of \(X\) are centered around \(\mu\).

As the general construction prescribes, the cumulative distribution function is \(F(x) = \int_{-\infty}^x f(t)\ dt\). However, unlike with uniformly distributed variables, this cdf does not have a “closed form.” To make computations for it, it is necessary to use approximate (numerical) computations of the integral. It can be worthwhile to be familiar with its value at \(x = \mu + k\sigma\) for some different multiples of \(\sigma\).¹⁹

Exercise 9. If \(X \sim \mathcal N(\mu, \sigma^2)\), and \(F(x)\) is the cdf of \(X\), explain why \(F(\mu) = 0.5\). [The explanation can be made without actually carrying out the calculation of an integral (assuming it is known that the integral over \(\mathbb R\) of \(f(x)\) is equal to 1). To do this, do a substitution \(u = x – \mu\), and think about the function \(f(u)\). Does it have a symmetry you can exploit?]

There are a number of random variables that occur “in nature,” which seem to be modeled well by assuming they have a normal distribution. Some examples might include: the strength of an observed hydrogen bond between two water molecules; the temperature reading from a thermometer, on a given individual; the weight of a 30-year old female from a given region.

Despite what was said in the previous paragraph, many naturally occurring random variables are not modeled well with an underlying normal distribution, e.g. incomes, time to the next earthquake, the population density of a locale, and more. One should be careful about making an assumption that a random variable is normally distributed.

Sampling Values of a Random Variable

For the random variables that I have discussed to this point (both discrete and continuous), I had complete knowledge of the probability distribution (the pmf for discrete variables, and the pdf for continuous variables). For example, when \(X\) was the Blackjack score of a two-card hand, knowing the drawing came from a (well-shuffled) 52-card playing deck, one can know by counting outcomes, what is the probability of any value of \(X\). When discussing a normally, or uniformly, distributed continuous variable, I essentially defined the random variable by the probability density function.

For real world examples of random variables, like those that I have brought up, you almost never have that complete knowledge. At the end of the last section, I mentioned a random variable with values being the weights of 30-year old females from a given region. Let’s consider it.

To be concrete, say that you give “30-years old” its calendar meaning: anyone whose 30th birthday has arrived, but their 31st birthday hasn’t yet (even if it is the next day!). Also, say the sample space is 30-years old individuals, identifying as female, who live in Spain. For an \(\omega\) in this set of people, \(X(\omega)\) is the exact weight of that person.

Now, there is an exact weight of each of the people mentioned, at a given time at least. So, \(X\) is truly distributed according to some function at that time. That distribution function is called the population distribution. But…think about the difficulties of determining this function. How do you get exact weights, for one? Even if you use some approximate weight, it changes a bit during the course of a day. Since there are millions of people in the set, can you realistically ever find the population distribution?

The distributions of most random variables similarly suffer from being impossible to know completely (think of Example 4, where some of the outcomes you care about will occur in the future). So, we instead work with an empirical distribution for \(X\), which is the distribution of some observations of values of that variable. In other words, after someone makes observations of values of those variables and records them, that data is the empirical distribution.

Example 9. The set of all petal lengths of Iris flowers have a population distribution. Using the Iris data set, we have an empirical distribution. For example, we can find its mean, getting: average_petal_length = 3.758.

import numpy as np 
import pandas as pd
from sklearn.datasets import load_iris 
iris_data = pd.DataFrame(load_iris()['data'], columns=load_iris()['feature_names'])

average_petal_length = np.mean(iris_data['petal length (cm)'].values)

When the data collected is “representative,” then the empirical distribution is a good approximation to the population distribution. In that case, quantities like the empirical mean, computed above for Iris petal lengths, will be good approximations of the population mean.

There are situations, when you have a large data set, where you want to simulate the process of making observations and getting data. Numpy is very useful for this, by using its random package of functions. For example, if your data set is in a list called data that has length L, one data point in each index, you can pick a random index by calling np.random.randint(L). Using this, you can pull up a random data point by calling data[np.random.randint(L)]. Repeating this some number of times, you can simulate collecting data. This can be a very useful technique when determining model parameters to make it less sensitive to noise.

In fact, you don’t need to restrict yourself to numbers only. Doing so just simplifies the discussion, so it is done this way most of the time. For an example of a random variable that does not have numerical values: the value being the manufacturer of the next car to be seen to travel down a highway. ↩
One could, instead, choose to have \(\mathcal T\) be the codomain, not requiring the function to be onto. There is no meaningful difference in the choice, but it is important to remember which choice was made. ↩
In full generality, there are some technicalities on the target space that must be set. There is a collection of its subsets that have a measure, turning \(\mathcal T\) into something called a measurable space, and probabilities all come from subsets that have a measure. Such concerns essentially can be avoided when dealing only with discrete random variables. When I get to continuous random variables, I will try to keep it as simple as possible. ↩
Or countably infinite, but we will completely ignore that setting. ↩
While \(p\) does not make reference to the variable \(X\), I am trying to walk a fine line between consistency with the MML book‘s notation, and choosing helpful notation that doesn’t collide with other notation in use. Whenever it seems needed, I will point out that I am discussing the pmf of a random variable. ↩
The figure is not really a histogram, just bars to show the value of the pmf. ↩
We have made the assumption that the searches are independent, pairwise (see the section on independent random variables). ↩
This is pretty bad notation, since the underlying random variable is different (its distribution uses conditional probabilities, given \(Y=y\) for that particular \(y\); so it does not have the same sample space). The authors of the MML book admit as much. However, it is commonly used, so we will use it. ↩
In some sense, one could argue that this definition of the cdf is the “natural” generalization to higher dimensional random variables. It should be noted, though, that using it means that you are really thinking of \(X\) and \(Y\) individually, as separate random variables. To some degree, you are treating the coordinates (or features) as more essential than the points themselves. This will have a heavy hand on the results of doing analysis on sample points from that random variable. ↩
In this probability table, \(X\) is the minimum of rolling two 4-sided dice (equal chance of each number 1 through 4), and \(Y\) is the result of subtracting the largest from the smallest (or getting 0 if the rolled numbers are equal). ↩
When you start with \((X, Y)\) and its joint distribution, the distribution \(p_X\) is sometimes called the marginal distribution on \(X\). ↩
When the events involved have positive probability, then this is the same as saying that, for every \(x, y\), the probability of the event \(X\le x\) is the same as its conditional probability given that \(Y\le y\). ↩
Beginning with the smallest \(x\) (and smallest \(y\)) in the respective target spaces, a double induction argument shows that these are the same. ↩
Especially, uncountably infinite, like an interval of \(\mathbb R\). ↩
To fully treat the notion of probabilities on infinite sets requires a discussion of a mathematical construction called a measure. I will avoid this and try to keep it simple, defining probabilities on \(\mathbb R\), or certain “nice” subsets of \(\mathbb R\). ↩
Recall that an integral, by definition, is the limit of a sum of (signed) areas of rectangles, the rectangles thinning as you have more of them. If you have care with the heights of the rectangles, then this will allow you to generalize a finite sum of probabilities to the infinite setting. ↩
If you happen to have a physics background, a density is always some type of derivative. ↩
For \(\mu = 0\) and \(\sigma = 1/\sqrt{2}\), there is a nice way to use polar coordinates to show the integral equals 1. Some easy alterations to the calculation then show how it generalizes to arbitrary \(\mu\) and \(\sigma\). ↩
See the 68-95-99.7 rule. ↩