Probability refresher

Nmeonbut

In this post, I review some of the basics in probability. If you are in one of my courses, you have likely seen this before. If you haven’t though, my goal is to be thorough. I will define important terms and, at times, provide links to further reading. The need to understand some probability is essentially inescapable, even for a modest understanding of machine learning.


The Basics

Probability textbooks often use terms like “experiments” and “trials.”  The authors of these texts likely choose such words for lack of a better choice. An “experiment” in probability does not necessarily mean what you would think of as a scientific experiment. For example, it could be drawing a card from a deck, or asking a person a question. It is simply the process of observing and, perhaps, recording (as data) the outcome of something that happens — usually, the thing happening is uncertain to some degree.

The thing that happens is the observation (at times, a random draw or observed value). As an initial example, suppose you were to go around asking people what US state they live in (giving them the option to say ‘Not in a US state’). In this terminology, an answer from a person is an observation. The sample space \(\Omega\) is the set of all possibilities for the outcome. So in this example, \(\Omega\) is the set containing the 51 possible answers: the 50 states and ‘Not a US state.’

An event is a subset of \(\Omega\). When \(\Omega\) is a finite set, and \(A\) is a subset of \(\Omega\) (written as \(A\subseteq \Omega\) ), then the probability of the event \(A\) is the size of \(A\) divided by the size of \(\Omega\). 1 In the example above, the subset \(A = \lbrace\)Washington, Oregon, California\(\rbrace\) is an event (which you maybe describe as the respondent “living on the west coast”). The size of \(A\) is \(3\) and the size of \(\Omega\) is \(51\). So, with the given definition of probability of an event, \(A\) has probability \(3/51 \approx 0.059\) (in other words, about \(5.9\%\)). Does this probability seem real to life? Maybe not — there are a lot of people living in California, almost 1/8 of the entire country’s population. What if, instead, the question was being posed to US senators, and the question asked what state they represent? (Throw out the ‘Not a US state’ option, so now \(\Omega\) has size \(50\).) Then the probability is \(3/50\), and this is the proportion of senators from those states.

Write \({\bf P}(A)\) for the probability of an event \(A\subseteq\Omega\). The size of a set \(A\) (the number of elements in it) is often written as \(|A|\), so we have \({\bf P}(A) = |A|/|\Omega|\).

An Ace of spades and Jack of spades. Image obtained from website run by Bicycle Casino.

Example 1. Suppose someone is dealt one card from the top of a shuffled deck of playing cards (to be clear, the deck has 52 cards with only one of each card). The observation is the card that is dealt. The subset representing the event that “an Ace is dealt” has four elements: \(\lbrace A\clubsuit, A\diamondsuit, A\spadesuit, A\heartsuit\rbrace\). Its probability is \(4/52 = 1/13\).

Example 2. Someone is dealt two cards, in order, from the top of a shuffled 52-card deck. Now, the observation is the ordered pair of cards dealt. Find the probability that the first card is a Queen (\(Q\)).

The sample space \(\Omega\) is all ordered pairs of cards coming from the one deck. There are just 51 cards left when the second card is drawn, so \(|\Omega| = 52\cdot 51 = 2652\). 2 This matches our event (call the event \(A\)) as long as the first card in the pair is a Queen. The second card is unimportant for the event. So we should look at those pairs of cards which look like:

\((Q\clubsuit, anything),\)
\((Q\diamondsuit, anything),\)
\((Q\spadesuit, anything),\)
\((Q\heartsuit, anything).\)

There are \(51\) possibilities for each. Also, whenever the first cards are different, then the two observations (ordered pairs) cannot be the same observation, so \(4\cdot 51\) does not double count anything and \(|A| = 204\). Thus \({\bf P}(A) = 204/2652 = 1/13 \approx 0.077\).

Exercise. Four cards were already drawn from from a 52-card deck and none of them were face cards (a card that is either a King, a Queen, or a Jack). What is the probability that the next card is a face card?


Counting.

As already seen in the previous example, some facts about determining the size of a set are needed. You likely encountered these in an early math course.

Suppose an ordered pair has \(m\) possibilities for its first object and, regardless of what the first object is, there are \(n\) possibilities for the second object. Then there are \(mn\) different ordered pairs. 3 In this general setting, the number of possibilities for the first and second object might have nothing to do with each other. For example, you roll a six-sided die (6 possibilities) and then you draw a playing card (52 possibilities).

If it is the case that the second object comes from what remains after the first object was observed (like in the example of drawing two cards in succession), then \(n = m – 1\). This thinking can be applied multiple times. For example, if you were to draw \(r=4\) cards in succession from a 52-card deck (with each card distinct, \(m=52\)), then there are \(m(m-1)(m-2)(m-3) = (52)(51)(50)(49)\) possible outcomes. The series of four draws is an ordered tuple: \((card1, card2, card3, card4)\). In terms of factorials, the number of possibilities is \(\dfrac{52!}{48!}\); generally, drawing \(r\) items in order from \(m\) distinct items, gives \(\dfrac{m!}{(m-r)!}\) possibilities.

Finally, there are experiments where you do not care about the order. To count the possibilities in this case, you divide the count in the previous paragraph by the number of different orders. For example, you might want to know the probability of being dealt a certain set of 5 cards from a deck of playing cards, and only care about which cards are in your hand. For each set of 5 cards, there are \(5!\) different ways to order it. 4 Since there are that many ways to order, the number of different sets of 5 cards is (num. of ordered 5-tuples) divided by \(5!\).

In general, the number of possibilities for choosing \(r\) items (ignoring order), from a set of \(m\) items, is

\(\dfrac{m!}{(m-r)!r!} = \binom{m}{r}\).

Monty Burns playing poker. GIF obtained from gifer.com.

In that formula, it is essential that the items are distinct from each other. If you had a deck of cards with two Aces of hearts, and no Ace of diamonds, then the number of possible 5-card hands would not be \(\binom{52}{5}\). Using a smaller size deck, this is quite easy to see. Say a deck has three cards, two of which have a 1 printed on them (and are indistinguishable) and the third card has a 2 printed on it. The only two-card hands that can be drawn are: \(\lbrace 1,1\rbrace\) and \(\lbrace 1,2\rbrace\). So, in this non-unique item case, there are not \(\binom{3}{2} = 3\) possibilities.

Here is an example of these counting techniques at work when computing a probability.

Example 3. Suppose two cards are dealt randomly from a single 52-card playing deck. Compute the probability that at least one of the cards is a Queen.

It doesn’t matter whether a Queen is dealt as the first or the second card (or even as both). The sample space \(\Omega\) is the collection of “hands” of two playing cards (which cannot be identical); the event \(A\) contains those hands with at least one Queen.
A Queen in the hand means one of \(Q\clubsuit, Q\diamondsuit, Q\spadesuit, Q\heartsuit\). If there is exactly one Queen, then there are only 48 possibilities for the other card. There are \(4\cdot 48\) such outcomes (the suit of the Queen changes, so this doesn’t double count). If both cards in the hand are Queens, they must come from the set of 4 Queens. There are \(\binom{4}{2}=6\) such possibilities.
Since, \(|\Omega| = \binom{52}{2} = 26\cdot 51 = 1326\),

\({\bf P}(A) = \dfrac{4\cdot 48 + 6}{1326} \approx 0.149.\)


Probability functions.

Given any sample space \(\Omega\) (possibly infinite), a probability function \({\bf P}\) always assigns a number to any event so that 3 conditions are met.

    • \({\bf P}(A) \geq 0\) for every \(A\subseteq\Omega\)
    • \({\bf P}(\Omega) = 1\).
    • If \(A_1,A_2,\ldots,A_n\) are subsets of \(\Omega\) so that \(A_i\cap A_j = \emptyset\) whenever \(i\neq j\), then
      \({\bf P}(A_1\cup A_2\cup\ldots\cup A_n) = \sum_{i=1}^n {\bf P}(A_i). \)

When a collection of sets is like \(A_1,A_2,\ldots,A_n\) in the third bullet, so that \(A_i\cap A_j = \emptyset\) when \(i\neq j\), it is common to call them either mutually exclusive or pairwise disjoint.

The finite case we have been discussing, where we define \({\bf P}(A) = |A|/|\Omega|\), satisfies the conditions. Checking this claim is straightforward, at least for the first two bullet points. For the third bullet point, if \(n=2\) (so you only have \(A_1, A_2\)) then the fact that \(A_1\) and \(A_2\) have no common points means that \(|A_1\cup A_2| = |A_1| + |A_2|\). The identity in the third bullet follows from this and an induction argument. Note, the equation in the third bullet will not hold if any pair \(A_i, A_j\) have a common point. 5

For a subset \(A\subseteq \Omega \), write \(\bar A\) for the complement – the set of points in \(\Omega\) but not in \(A\). Then, since \(A \cap \bar A = \emptyset\) and \(A\cup\bar A = \Omega\), you get that

\(1 = {\bf P}(\Omega) = {\bf P}(A\cup\bar A) = {\bf P}(A) + {\bf P}(\bar A).\)

And so it must be that \({\bf P}(\bar A) = 1 – {\bf P}(A)\), regardless of the probability function. This observation can be really helpful for simplifying computations.

For example, recall the question of Example 3. We could consider \(\bar A\) instead — not getting any Queens —  and get \({\bf P}(A)\) by finding \(1-{\bf P}(\bar A)\). If the hand does not have any Queens, then both cards come from a set of 48 cards, meaning that \(|\bar A| = \binom{48}{2}\). And so

\({\bf P}(A) = 1 – {\bf P}(\bar A) = 1-\dfrac{\binom{48}{2}}{\binom{52}{2}} = 1 – \dfrac{(48)(47)}{(52)(51)} \approx 1 – 0.851 = 0.149. \)

Sometimes you want to know the probability of one event OR another event occurring (that is, the probability of the union). If the events are mutually exclusive (as defined above), then the probability of the union is the sum of the individual probabilities. However, in general, if \(A_1\) and \(A_2\) are two events then

\({\bf P}(A_1\cup A_2) = {\bf P}(A_1) + {\bf P}(A_2) – {\bf P}(A_1\cap A_2)\).

For the given definition of probability, when \(\Omega\) is finite, deriving the equation above is the same as checking that \(|A_1\cup A_2| = |A_1| + |A_2| – |A_1\cap A_2|\). The intuitive reason this is true is that each point that is in both subsets will only get counted once in \(|A_1\cup A_2|\), but will get counted twice when finding \(|A_1|+|A_2|\). A more careful proof could be given by induction on the size of \(A_1\cap A_2\).

Example 4. Suppose two cards are dealt randomly from a single 52-card playing deck. Compute the probability that at least one of the cards is either a Queen or a King.

Let \(A\) be the event that at least one card is a Queen, and let \(B\) be the event that at least one is a King. In Example 3 above, we computed that \({\bf P}(A) \approx 0.149\). Since the number of Kings in the deck is the same as the number of Queens, it is obvious that \({\bf P}(B) = {\bf P}(A)\).
Consider the set \(A\cap B\). This is the set of two-card hands where there is at least one Queen AND at least one King. So the hand has the Queen of some suit and the King of some suit (4 choices for each one). The number of such hands is \(4\cdot 4 = 16\). This means that

\({\bf P}(A\cup B) \approx 0.149 + 0.149\ – \dfrac{16}{\binom{52}{2}} \approx 0.286\).


Conditional Probability.

It is so, so, so important to remember that the probability of an event represents its likelihood before the observation. Said in a bit different of a way, if it is an event that has already been observed then its probability is always the same: it has probability 1. Period. Because, well, it happened. From the point of view of the sample space, if you have the outcome already, it is the only possible outcome, so it is the only thing in \(\Omega\).

This makes it natural to think about the following situation: what if you already know that some other event occurred? Does it change the probability of the event you were originally interested in? Yes, sometimes!

If you know event \(B\) happened, your observation must come from within \(B\) because, with that knowledge, the only possible outcomes are those within \(B\).

You are still interested in \(A\) though. Since your observation must come from \(B\), any outcomes in \(\Omega\) that are not in \(B\) are not possible. Therefore, you should just divide by \(|B|\). Moreover, you should only consider the possibilities in \(A\) that are also in \(B\), i.e. the outcomes in \(A\cap B\). So, the probability of \(A\) given \(B\) is:

\({\bf  P}(A | B) = \dfrac{|A\cap B|}{|B|} = \dfrac{|A\cap B|}{|\Omega|}\dfrac{|\Omega|}{|B|} = \dfrac{{\bf P}(A\cap B)}{{\bf P}(B)}.\)

Note how this makes it so that if every outcome in \(B\) is also in \(A\), then \({\bf P}(A | B) = 1\) (since, in this case, \(A\cap B = B\)).

Exercise. From a shuffled 52-card deck, you deal someone a card, so that you cannot see what it is. That person tells you the card is a face card (remember, this means a King, Queen, or Jack). Given that the card is a face card, find the probability that

    • the card is the Queen of hearts
    • the card is a red King (red meaning it is the heart suit or diamond suit)
    • the card is of the heart suit.

Two events \(A\) and \(B\) are called independent if \({\bf P}(A | B) = {\bf P}(A)\). In words, this says that whether \(B\) occurs or not, the probability of \(A\) would be unchanged. By the way, this implies that the probability of \(B\) is also unchanged by knowing whether \(A\) occurs, since

\({\bf P}(A) = {\bf P}(A | B) = \dfrac{{\bf P}(A\cap B)}{{\bf P}(B)}\)

means that \({\bf P}(B) = \dfrac{{\bf P}(A\cap B)}{{\bf P}(A)} = {\bf P}(B | A)\). Note, that events \(A\) and \(B\) are independent if and only if 6 \({\bf P}(A\cap B) = {\bf P}(A){\bf P}(B)\) also.

Exercise. One card is dealt from a (shuffled) 52-card playing card deck. Which of the following events are independent?

    1. \(A = \) the card is a heart; \(B = \) the card is a face card.
    2. \(A = \) the card is a 9; \(B = \) the card is an Ace.

Exercise. Two cards are dealt from a (shuffled) 52-card playing card deck. Which of the following events are independent? (Read “one card is…” as meaning exactly one card.)

    1. \(A = \) one card is a 2; \(B = \) one card is a Queen.
    2. \(A = \) one card is a heart; \(B = \) one card is an Ace.

Can you explain some of the calculations (in the 5-card poker table) on the Poker Probability Wikipedia page?


Bayes’ Theorem

There is an identity that goes by the name of Bayes’ Theorem. While it is a rather simple rearrangement of the definition of conditional probabilities, it is a very useful identity to keep in mind when using data to inform your belief (more precisely, to update your belief) of how likely something is. This is called using Bayesian inference, and I will write more on this later.

Bayes’ Theorem: \(\qquad\qquad P(B|A) = \dfrac{P(A|B)P(B)}{P(A)}\)

To see that the identity is true, you can simply write the definition of \({\bf P}(B | A) = {\bf P}(A\cap B)/{\bf P}(A)\) and the definition of \({\bf P}(A | B) = {\bf P}(A\cap B)/{\bf P}(B)\).

As an example of using Bayes’ theorem, let’s see how it works in a game.

Someone tells you they have three coins. There is a “heads” coin, which has heads on both sides. There is a “tails” coin with tails on both sides. Finally, the third coin is normal, one side heads and one side tails. You check and see they are telling the truth, and that there is no way to distinguish any particular heads side from another (and similarly with tails).
After letting you examine the coins, they mix up the three coins in a bag. Then they pick one of the coins and slam it down on a table. It is showing a heads side.
“Let’s make this interesting,” they say. “If this is the “heads” coin, with heads on the other side, then you pay me $20. If it is the normal coin, though, then I’ll pay you $30. Deal?”
Do you take the deal?

The answer comes down to the conditional probability \({\bf P}(\textrm{“heads” coin} | \textrm{heads shows})\). That is, let \(HH\) be the event that the coin is the heads coin and let \(H\) be the (given) event, that there is a head side showing on the table. Then we want to know \({\bf P}(HH | H)\). If the probability is 1/2, then you should take the deal, since, in that case, on average you would win half the time and the reward is larger than the loss. However, …

Note the total probabilities \({\bf P}(HH) = 1/3\) and \({\bf P}(H) = 1/2\), the latter since the total number of sides of coins is 6, and 3 of them are heads. Also, we know that \({\bf P}(H | HH) = 1\) since, if the coin has heads on both sides, it has to show heads. By Bayes’ Theorem,

\({\bf P}(HH | H) = \dfrac{{\bf P}(H | HH){\bf P}(HH)}{{\bf P}(H)} = \dfrac{(1/3)}{(1/2)} = \dfrac23\).

In other words, given that a heads side is showing, it is a 2/3 chance that the coin is the “heads” coin. But then, on average, your reward is \((2/3)(-20)+(1/3)(30) = -10/3\) dollars. You’ll lose money on average, so you don’t take the deal.


  1. At least, this is very often how probability is defined when the sample space is finite. It is, however, not necessary (and, in cases, not the best choice) to define the probability this way.
  2. You don’t put the first card back before the second draw!
  3. In the notation of sets, if \(U, V\) are finite sets with \(|U|=m\) and \(|V|=n\), then \(|U\times V|=mn\).
  4. Each way is an ordered tuple of 5 cards, from just 5 possible cards. So there are \(5!/(5-5)! = 5!\) possibilities.
  5. For the probability function given on finite sets.
  6. Implicit to our discussion of independent events is the requirement that both events have non-zero probability.