Probability Models and Axioms

Probability
Mathematics
Building the foundation of probability theory from scratch: sample spaces, events, and the three axioms that govern all of probability.
Author

Sushrut

Published

March 25, 2026

This post is the first in a series covering the core ideas from MIT’s Probabilistic Systems Analysis and Applied Probability course (MIT OpenCourseWare, 2014) and its newer counterpart, Introduction to Probability (MIT OpenCourseWare, 2018). The primary textbook for this series is Bertsekas & Tsitsiklis (2008), and we will occasionally reference Blitzstein & Hwang (2019) for complementary perspectives. This first post covers the material from Sections 1.1 and 1.2 of Bertsekas & Tsitsiklis (2008).

Why Probability?

Imagine you are about to leave the house. You glance at the sky: it is overcast, but not dark. Should you carry an umbrella?

You do not know whether it will rain. Nobody does, not with certainty. But you do have beliefs about how likely rain is. Maybe you checked the forecast and saw a 30% chance. Maybe you just noticed the clouds and thought “probably not, but maybe.” Either way, you are doing something interesting: you are reasoning about uncertainty. You are assigning some notion of “likelihood” to different possible futures and using that to make a decision.

This kind of reasoning is not limited to weather. An engineer designing a communication system must account for random noise. A data scientist building a model must account for variability in the data. A doctor interpreting a test result must weigh the probability that the patient actually has the disease. A poker player deciding whether to fold must estimate the likelihood of various hands.

The situations differ wildly, but the underlying challenge is the same: how do we think about uncertainty in a precise, systematic way? Probability theory gives us exactly this. It provides a mathematical framework for describing what could happen, how likely each possibility is, and how to reason logically from those likelihoods to useful conclusions.

In this post, we build that framework from the ground up. By the end, you will know the three ingredients of any probabilistic model and the three axioms that every probability assignment must satisfy. These axioms are simple, almost obvious, but they are the foundation on which the entire edifice of probability and statistics rests.

The Road Ahead

Before we dive in, here is a map of the journey we are about to take.

flowchart TD
    A["<b>Motivation</b><br/>Why do we need a framework<br/>for uncertainty?"]
    B["<b>Sample Space</b><br/>What can possibly happen?"]
    C["<b>Events</b><br/>What do we care about?"]
    D["<b>Probability Axioms</b><br/>The three rules of the game"]
    E["<b>Discrete Uniform Law</b><br/>Tetrahedral die example"]
    F["<b>Continuous Uniform Law</b><br/>Darts on a square"]
    G["<b>Countable Additivity</b><br/>When three axioms<br/>are not quite enough"]
    H["<b>Wrap-Up</b><br/>The complete recipe"]

    A --> B --> C --> D --> E --> F --> G --> H

    style A fill:#10a37f,color:#fff

We start with the most basic question: if we are going to describe an uncertain situation, what is the first thing we need to write down?

The Sample Space

Starting with a Simple Experiment

Let us begin with something concrete. You have a coin. You flip it once. What are all the possible things that could happen?

Only two: Heads or Tails. That is it. There is no third possibility (we are ignoring the coin landing on its edge, which is a modeling choice we will discuss shortly). We can write this as a set:

\[ \Omega = \left\{\text{Heads}, \text{Tails}\right\}. \]

This set, \(\Omega\) (the Greek capital letter “omega”), is called the sample space of the experiment. It is the collection of all possible outcomes. Every probabilistic model starts here: before we can talk about how likely anything is, we first need to be completely clear about what can happen.

Two Requirements

The sample space must satisfy two properties:

  1. Mutually exclusive:
    • At the end of the experiment, exactly one outcome occurs. If the coin lands Heads, it did not land Tails. The outcomes cannot overlap.
  2. Collectively exhaustive:
    • The list covers every possibility. No matter what happens when you flip the coin, the result is somewhere in \(\Omega\). You have not forgotten anything.

These two requirements together mean that when the experiment is done, you can point to exactly one element of \(\Omega\) and say: “this is what happened.”

The Art of Choosing the Right Sample Space

Here is a subtlety that is easy to overlook. Consider the same coin flip, but now imagine someone proposes the following sample space:

\[ \Omega' = \left\{\text{Heads}, \text{Tails and raining outside}, \text{Tails and not raining outside}\right\}. \]

Is this a valid sample space? Technically, yes. The three outcomes are mutually exclusive and collectively exhaustive. But is it useful? Almost certainly not. Unless you have some superstitious belief that the weather affects your coin, including “raining outside” in your model adds complexity without adding insight.

This illustrates a broader point that applies to all of science and engineering: when you build a model of a real situation, you choose which details to include and which to leave out. A good sample space captures the features that matter for your problem and ignores the rest. There is no single “correct” sample space for a given experiment; there is only a useful one. As Bertsekas & Tsitsiklis (2008) put it, choosing the right level of granularity for the sample space is partly an art.

A Richer Example: Two Rolls of a Tetrahedral Die

Coin flips are a fine starting point, but let us work with something that gives us more to explore. Imagine a four-sided die (a tetrahedron). When you roll it, the result is 1, 2, 3, or 4. Now consider the experiment of rolling this die twice.

A crucial point: we treat both rolls together as a single experiment, not as two separate experiments. The outcome of this single experiment is an ordered pair \(\left(x, y\right)\), where \(x\) is the result of the first roll and \(y\) is the result of the second roll.

For example, \(\left(2, 3\right)\) means “the first roll was 2 and the second roll was 3.” This is a different outcome from \(\left(3, 2\right)\), where the first roll was 3 and the second roll was 2. Even though the same two numbers appeared, the order matters.

The full sample space is the set of all such pairs:

\[ \Omega = \left\{\left(i, j\right) : i \in \left\{1, 2, 3, 4\right\},\; j \in \left\{1, 2, 3, 4\right\}\right\}. \]

This gives us \(4 \times 4 = 16\) possible outcomes. We can visualize them as a grid.

sample_space grid Y \ X 1 2 3 4 4 (1,4) (2,4) (3,4) (4,4) 3 (1,3) (2,3) (3,3) (4,3) 2 (1,2) (2,2) (3,2) (4,2) 1 (1,1) (2,1) (3,1) (4,1)
Figure 1: The sample space for two rolls of a tetrahedral die. Each cell represents one outcome \((x, y)\).

Let \(X\) denote the result of the first roll (the column) and \(Y\) denote the result of the second roll (the row). Each cell in Figure 1 is one outcome, and there are 16 cells in total.

The Sequential (Tree) Description

There is another way to describe this same sample space that is especially useful when the experiment unfolds in stages. Since our experiment has two stages (first roll, then second roll), we can draw a tree that traces out all the possible ways the experiment can evolve.

flowchart LR
    R["Start"] --> A1["1"]
    R --> A2["2"]
    R --> A3["3"]
    R --> A4["4"]

    A1 --> B11["(1,1)"]
    A1 --> B12["(1,2)"]
    A1 --> B13["(1,3)"]
    A1 --> B14["(1,4)"]

    A2 --> B21["(2,1)"]
    A2 --> B22["(2,2)"]
    A2 --> B23["(2,3)"]
    A2 --> B24["(2,4)"]

    A3 --> B31["(3,1)"]
    A3 --> B32["(3,2)"]
    A3 --> B33["(3,3)"]
    A3 --> B34["(3,4)"]

    A4 --> B41["(4,1)"]
    A4 --> B42["(4,2)"]
    A4 --> B43["(4,3)"]
    A4 --> B44["(4,4)"]

The first level of the tree represents the four possible results of the first roll. From each of those, four branches lead to the possible results of the second roll. Each leaf (endpoint) of the tree corresponds to exactly one outcome in our sample space. There are \(4 \times 4 = 16\) leaves, matching the 16 cells in the grid.

A quick note on language: when we talk about the result of a single roll, we call it a result or a stage result. We reserve the word outcome for the complete description of what happened in the entire experiment. So \(\left(2, 3\right)\) is the outcome; 2 is the result of the first stage, and 3 is the result of the second stage.

Finite vs. Infinite Sample Spaces

The tetrahedral die example has a finite sample space: there are exactly 16 outcomes. But not every experiment is this tidy. Consider the following scenario.

You are throwing a dart at a square dartboard whose sides go from 0 to 1. You are guaranteed to hit the board (you are very good at darts), but exactly where the dart lands is uncertain. A typical outcome is a pair of coordinates \(\left(x, y\right)\) where \(x\) and \(y\) are real numbers between 0 and 1. The sample space is:

\[ \Omega = \left\{\left(x, y\right) : 0 \leq x \leq 1,\; 0 \leq y \leq 1\right\}. \]

This is the unit square, and it contains infinitely many points. In fact, it contains uncountably many points. This is an example of a continuous sample space, and it will require us to think about probability slightly differently than in the discrete case. We will come back to this example after we have established the rules.

Let us take stock. We now know the first ingredient of any probabilistic model: the sample space \(\Omega\). It is a set that lists all the possible outcomes of an experiment, and it must be mutually exclusive and collectively exhaustive. We have seen that sample spaces can be finite (like the die) or infinite (like the dartboard). But knowing what can happen is only half the story. The next question is: how likely is each possibility?

flowchart TD
    A["<b>Motivation</b><br/>Why do we need a framework<br/>for uncertainty?"]
    B["<b>Sample Space</b><br/>What can possibly happen?"]
    C["<b>Events</b><br/>What do we care about?"]
    D["<b>Probability Axioms</b><br/>The three rules of the game"]
    E["<b>Discrete Uniform Law</b><br/>Tetrahedral die example"]
    F["<b>Continuous Uniform Law</b><br/>Darts on a square"]
    G["<b>Countable Additivity</b><br/>When three axioms<br/>are not quite enough"]
    H["<b>Wrap-Up</b><br/>The complete recipe"]

    A --> B --> C --> D --> E --> F --> G --> H

    style A fill:#555555,color:#aaaaaa
    style B fill:#555555,color:#aaaaaa
    style C fill:#10a37f,color:#fff

Events: The Objects That Get Probabilities

Why Not Assign Probabilities to Individual Outcomes?

Your first instinct might be: just assign a probability to every single outcome. For the tetrahedral die, that works fine. We have 16 outcomes, and we can assign a number to each one.

But remember the dartboard. Consider a specific point, say \(\left(0.5, 0.3\right)\). What should the probability of hitting exactly that point be? Intuitively, it should be zero. A single point has no area, and with infinitely many points on the board, the chance of hitting any particular one with infinite precision is zero.

If every individual outcome has zero probability, then telling someone “the probability of each outcome is zero” is not very informative. It does not help us distinguish between, say, the upper half of the board and a tiny corner. We need a richer language.

Events to the Rescue

The solution is to assign probabilities not to individual outcomes, but to subsets of the sample space. In probability, these subsets are called events.

An event is simply a collection of outcomes. For example, in the tetrahedral die experiment (whose sample space is given in Figure 1):

  • “The first roll is 1” is the event \(\left\{\left(1,1\right), \left(1,2\right), \left(1,3\right), \left(1,4\right)\right\}\).
  • “The sum of the two rolls is odd” is the event containing all outcomes \(\left(x, y\right)\) where \(x + y\) is an odd number.
  • “The minimum of the two rolls is 2” is the event containing all outcomes where \(\min\left(x, y\right) = 2\).

In the dartboard experiment:

  • “The dart lands in the upper half” is the event \(\left\{\left(x, y\right) : y > 0.5\right\}\).
  • “The sum of the coordinates is at most \(1/2\)” is the event \(\left\{\left(x, y\right) : x + y \leq 1/2\right\}\), which is a triangular region.

We say that an event \(A\) occurred if the actual outcome of the experiment is an element of \(A\). If the outcome falls outside \(A\), then \(A\) did not occur. This language turns out to be very natural: “Did the event ‘the first roll is 1’ occur? Yes, because the outcome was \(\left(1, 3\right)\), and that is in the set.”

Now we are ready for the main question: if we are going to assign probabilities to events, what rules should those probabilities follow?

The Probability Axioms

The Ground Rules

A probability law is a function that takes an event (a subset of \(\Omega\)) and returns a number, which we interpret as the “likelihood” that the event occurs. But not just any function will do. The function must obey three rules, called the axioms of probability. These axioms, first formalized by Andrey Kolmogorov in 1933 (Kolmogoroff, 1933), are the ground rules that any legitimate probability assignment must satisfy.

NoteThe Three Axioms of Probability

Let \(\Omega\) be a sample space and let \(\mathrm{P}\) be a probability law defined on events (subsets of \(\Omega\)). Then:

  1. (Nonnegativity) For every event \(A\),

\[ \mathrm{P}\left(A\right) \geq 0. \]

  1. (Normalization) The probability of the entire sample space is

\[ \mathrm{P}\left(\Omega\right) = 1. \]

  1. (Additivity) If two events \(A\) and \(B\) are disjoint (meaning \(A \cap B = \varnothing\); they share no outcomes), then

\[ \mathrm{P}\left(A \cup B\right) = \mathrm{P}\left(A\right) + \mathrm{P}\left(B\right). \]

Let us unpack each one.

  • Axiom 1 (Nonnegativity) says that probabilities cannot be negative. A probability of zero means “this event is essentially impossible.” A probability of 0.5 means “this event is moderately likely.” But \(-0.3\) as a probability? That has no meaning. Probabilities are always at least zero.
  • Axiom 2 (Normalization) says that something must happen. Since the sample space \(\Omega\) contains all possible outcomes, the event “the outcome is somewhere in \(\Omega\)” is certain. We represent certainty with the number 1.
  • Axiom 3 (Additivity) is the most interesting one. It says: if two events cannot both happen at the same time (they are disjoint), then the probability of either one happening is the sum of their individual probabilities.

The Cream Cheese Analogy

Here is an intuitive way to think about what the axioms are saying, adapted from the lecture by Prof. Tsitsiklis (MIT OpenCourseWare, 2018). Imagine you have exactly one pound of cream cheese, and you spread it over the sample space \(\Omega\). The total amount of cream cheese is 1 (that is Axiom 2). No region has a negative amount of cream cheese (Axiom 1). And if you have two non-overlapping regions \(A\) and \(B\), the total cream cheese on \(A \cup B\) is obviously the cream cheese on \(A\) plus the cream cheese on \(B\) (Axiom 3).

Probabilities behave just like cream cheese, or like mass, or like area. They are spread across the sample space, they add up to 1, and when you combine non-overlapping pieces, the total is the sum of the parts.

A First Consequence: Probabilities Are at Most 1

You might have noticed that the axioms say probabilities are at least 0, but they do not explicitly say that probabilities are at most 1. Do we need a fourth axiom for that? No. It follows from the three axioms we already have. Here is the argument.

Take any event \(A\). Its complement, \(A^{\mathsf{c}}\), is the set of all outcomes not in \(A\). Together, \(A\) and \(A^{\mathsf{c}}\) cover the entire sample space, and they share no outcomes. So:

\[ \begin{align*} \mathrm{P}\left(\Omega\right) &= \mathrm{P}\left(A \cup A^{\mathsf{c}}\right) && \text{(because } A \cup A^{\mathsf{c}} = \Omega\text{)}\\ 1 &= \mathrm{P}\left(A\right) + \mathrm{P}\left(A^{\mathsf{c}}\right) && \text{(Axioms 2 and 3)}\\ \mathrm{P}\left(A\right) &= 1 - \mathrm{P}\left(A^{\mathsf{c}}\right). \end{align*} \]

Since \(\mathrm{P}\left(A^{\mathsf{c}}\right) \geq 0\) by Axiom 1, we conclude that \(\mathrm{P}\left(A\right) \leq 1\).

This is a small but satisfying result. It shows that the axioms are not just a list of rules; they interact with each other in useful ways. We used all three axioms in just four lines to derive something we wanted to be true. This is the power of working from axioms: you state a few clean assumptions, and then you prove everything else.

Extending Additivity to More Than Two Events

Axiom 3 talks about two disjoint events. What if we have three? Suppose \(A\), \(B\), and \(C\) are pairwise disjoint (no two of them share an outcome). Then:

\[ \begin{align*} \mathrm{P}\left(A \cup B \cup C\right) &= \mathrm{P}\left(\left(A \cup B\right) \cup C\right) && \text{(rewrite the union)}\\ &= \mathrm{P}\left(A \cup B\right) + \mathrm{P}\left(C\right) && \text{(Axiom 3, since } (A \cup B) \cap C = \varnothing\text{)}\\ &= \mathrm{P}\left(A\right) + \mathrm{P}\left(B\right) + \mathrm{P}\left(C\right). && \text{(Axiom 3 again)} \end{align*} \]

By repeating this argument, we can extend additivity to any finite number of disjoint events. If \(A_1, A_2, \dots, A_n\) are pairwise disjoint, then:

\[ \mathrm{P}\left(A_1 \cup A_2 \cup \cdots \cup A_n\right) = \mathrm{P}\left(A_1\right) + \mathrm{P}\left(A_2\right) + \cdots + \mathrm{P}\left(A_n\right). \tag{1}\]

An important special case: suppose we have a finite set of outcomes \(\left\{s_1, s_2, \dots, s_k\right\}\). Each single outcome \(\left\{s_i\right\}\) is an event by itself, and these single-outcome events are pairwise disjoint. So by Equation 1:

\[ \mathrm{P}\left(\left\{s_1, s_2, \dots, s_k\right\}\right) = \mathrm{P}\left(s_1\right) + \mathrm{P}\left(s_2\right) + \cdots + \mathrm{P}\left(s_k\right). \tag{2}\]

In other words, the probability of a finite collection of outcomes is the sum of their individual probabilities. This is extremely useful, as we will see next.

We now have the rules of the game. Let us put them to work.

flowchart TD
    A["<b>Motivation</b><br/>Why do we need a framework<br/>for uncertainty?"]
    B["<b>Sample Space</b><br/>What can possibly happen?"]
    C["<b>Events</b><br/>What do we care about?"]
    D["<b>Probability Axioms</b><br/>The three rules of the game"]
    E["<b>Discrete Uniform Law</b><br/>Tetrahedral die example"]
    F["<b>Continuous Uniform Law</b><br/>Darts on a square"]
    G["<b>Countable Additivity</b><br/>When three axioms<br/>are not quite enough"]
    H["<b>Wrap-Up</b><br/>The complete recipe"]

    A --> B --> C --> D --> E --> F --> G --> H

    style A fill:#555555,color:#aaaaaa
    style B fill:#555555,color:#aaaaaa
    style C fill:#555555,color:#aaaaaa
    style D fill:#555555,color:#aaaaaa
    style E fill:#10a37f,color:#fff

The Discrete Uniform Law

Setting Up the Probability Law

Let us return to our tetrahedral die experiment. We have the sample space: 16 outcomes arranged in the grid of Figure 1. Now we need to assign a probability law.

We will make the simplest possible assumption: all outcomes are equally likely. This is a modeling choice, not a mathematical necessity. It reflects a belief that the die is fair and that neither roll influences the other. Under this assumption, each outcome has the same probability \(p\). Since there are 16 outcomes and they must sum to 1 (by Axiom 2 and Equation 2):

\[ \begin{align*} \underbrace{p + p + \cdots + p}_{16 \text{ times}} &= 1,\\ 16p &= 1,\\ p &= \frac{1}{16}. \end{align*} \]

So every cell in the grid has probability \(1/16\). This particular probability law is called the discrete uniform law. More generally, if a sample space has \(N\) equally likely outcomes, then each outcome has probability \(1/N\), and the probability of any event \(A\) is:

\[ \mathrm{P}\left(A\right) = \frac{\text{number of outcomes in } A}{N}. \tag{3}\]

Under the discrete uniform law, computing probabilities reduces to counting: count the outcomes that belong to the event, then divide by the total number of outcomes. This sounds simple, and for small examples it is. But counting can become surprisingly tricky for larger problems. In fact, we will devote an entire future post to systematic counting techniques (combinatorics).

Four Probability Questions

Let us use Equation 3 to answer four questions about the two-roll experiment, using \(X\) for the first roll and \(Y\) for the second roll.

Question 1: \(\mathrm{P}\left(\left(X, Y\right) \text{ is } \left(1,1\right) \text{ or } \left(1,2\right)\right)\)

The event consists of two outcomes: \(\left(1,1\right)\) and \(\left(1,2\right)\). By Equation 2:

\[ \mathrm{P}\left(\left\{\left(1,1\right), \left(1,2\right)\right\}\right) = \mathrm{P}\left(1,1\right) + \mathrm{P}\left(1,2\right) = \frac{1}{16} + \frac{1}{16} = \frac{2}{16} = \frac{1}{8}. \]

Question 2: \(\mathrm{P}\left(X = 1\right)\)

The event “the first roll is 1” corresponds to the set \(\left\{\left(1,1\right), \left(1,2\right), \left(1,3\right), \left(1,4\right)\right\}\). Visually, this is the entire first column of the grid. There are 4 outcomes, so:

\[ \mathrm{P}\left(X = 1\right) = \frac{4}{16} = \frac{1}{4}. \]

Question 3: \(\mathrm{P}\left(X + Y \text{ is odd}\right)\)

The sum \(X + Y\) is odd when one roll is even and the other is odd. Let us find these outcomes in the grid.

x_plus_y_odd grid Y \ X 1 2 3 4 4 (1,4) (2,4) (3,4) (4,4) 3 (1,3) (2,3) (3,3) (4,3) 2 (1,2) (2,2) (3,2) (4,2) 1 (1,1) (2,1) (3,1) (4,1)
Figure 2: Outcomes where \(X + Y\) is odd are highlighted in green.

As we can see in Figure 2, the highlighted outcomes form a checkerboard pattern. There are 8 such outcomes, so:

\[ \mathrm{P}\left(X + Y \text{ is odd}\right) = \frac{8}{16} = \frac{1}{2}. \]

This makes intuitive sense: exactly half the outcomes give an odd sum.

Question 4: \(\mathrm{P}\left(\min\left(X, Y\right) = 2\right)\)

This one requires a bit more care. The event “\(\min\left(X, Y\right) = 2\)” means that the smaller of the two rolls is exactly 2. This can happen in several ways: both rolls are 2, or one roll is 2 and the other is larger than 2. Crucially, neither roll can be 1 (otherwise the minimum would be 1, not 2).

min_xy_2 grid Y \ X 1 2 3 4 4 (1,4) (2,4) (3,4) (4,4) 3 (1,3) (2,3) (3,3) (4,3) 2 (1,2) (2,2) (3,2) (4,2) 1 (1,1) (2,1) (3,1) (4,1)
Figure 3: Outcomes where \(\min(X, Y) = 2\) are highlighted in green.

Counting the highlighted cells in Figure 3, we find 5 outcomes: \(\left(2,2\right)\), \(\left(2,3\right)\), \(\left(2,4\right)\), \(\left(3,2\right)\), and \(\left(4,2\right)\). So:

\[ \mathrm{P}\left(\min\left(X, Y\right) = 2\right) = \frac{5}{16}. \]

The General Procedure

Notice the procedure we followed in every question above:

  1. Start with the sample space (the grid).
  2. Identify the event of interest by locating the outcomes that belong to it.
  3. Count them and divide by the total number of outcomes.

This three-step process is the same procedure you would follow in any probability problem, discrete or continuous. The only thing that changes is step 3: instead of counting and dividing, you might need to compute an area, an integral, or something else. But the logic is always the same: describe the sample space, pin down the event, then use the probability law to compute.

Now let us see what happens when the sample space is continuous.

flowchart TD
    A["<b>Motivation</b><br/>Why do we need a framework<br/>for uncertainty?"]
    B["<b>Sample Space</b><br/>What can possibly happen?"]
    C["<b>Events</b><br/>What do we care about?"]
    D["<b>Probability Axioms</b><br/>The three rules of the game"]
    E["<b>Discrete Uniform Law</b><br/>Tetrahedral die example"]
    F["<b>Continuous Uniform Law</b><br/>Darts on a square"]
    G["<b>Countable Additivity</b><br/>When three axioms<br/>are not quite enough"]
    H["<b>Wrap-Up</b><br/>The complete recipe"]

    A --> B --> C --> D --> E --> F --> G --> H

    style A fill:#555555,color:#aaaaaa
    style B fill:#555555,color:#aaaaaa
    style C fill:#555555,color:#aaaaaa
    style D fill:#555555,color:#aaaaaa
    style E fill:#555555,color:#aaaaaa
    style F fill:#10a37f,color:#fff

The Continuous Uniform Law

Probability as Area

Let us return to the dartboard. The sample space is the unit square \(\Omega = \left\{\left(x, y\right) : 0 \leq x \leq 1,\; 0 \leq y \leq 1\right\}\). We need a probability law.

The simplest choice, analogous to “all outcomes are equally likely” in the discrete case, is the continuous uniform law: the probability of any region equals its area. If two subsets of the square have the same area, they have the same probability. A region with twice the area has twice the probability. And the area of the entire square is 1, which matches Axiom 2.

You can verify that this law satisfies all three axioms: areas are nonnegative (Axiom 1), the area of the whole square is 1 (Axiom 2), and the area of the union of two non-overlapping regions is the sum of their individual areas (Axiom 3).

Two Questions

Question 1: What is \(\mathrm{P}\left(\left(X, Y\right) = \left(0.5, 0.3\right)\right)\)?

A single point has zero area. So this probability is 0. This is perfectly consistent with the axioms (nothing says probabilities cannot be zero), and it matches our earlier intuition: with infinitely many possible points, hitting any one of them exactly has probability zero.

Question 2: What is \(\mathrm{P}\left(X + Y \leq 1/2\right)\)?

The event \(X + Y \leq 1/2\) is the set of all points \(\left(x, y\right)\) in the unit square that lie below (and on) the line \(x + y = 1/2\). This line crosses the \(x\)-axis at \(x = 1/2\) and the \(y\)-axis at \(y = 1/2\), forming a right triangle in the corner of the square.

Code
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import numpy as np

plt.style.use('dark_background')

fig, ax = plt.subplots(1, 1, figsize=(5, 5))

# Draw the unit square
square = patches.Rectangle((0, 0), 1, 1, linewidth=1.5, edgecolor='#e6e6e6', facecolor='#333333')
ax.add_patch(square)

# Shade the triangle X + Y <= 1/2
triangle = patches.Polygon(
    [[0, 0], [0.5, 0], [0, 0.5]],
    closed=True,
    facecolor='#10a37f',
    edgecolor='#10a37f',
    alpha=0.7,
    linewidth=2
)
ax.add_patch(triangle)

# Draw the line X + Y = 1/2
ax.plot([0.5, 0], [0, 0.5], color='#10a37f', linewidth=2)

# Labels
ax.set_xlabel('$x$', fontsize=14, color='#e6e6e6')
ax.set_ylabel('$y$', fontsize=14, color='#e6e6e6')
ax.set_xlim(-0.1, 1.15)
ax.set_ylim(-0.1, 1.15)
ax.set_aspect('equal')

# Annotate
ax.text(0.12, 0.12, '$X + Y \\leq \\frac{1}{2}$', fontsize=13, color='white', fontweight='bold')
ax.text(0.55, 0.02, '$\\frac{1}{2}$', fontsize=12, color='#10a37f')
ax.text(-0.07, 0.5, '$\\frac{1}{2}$', fontsize=12, color='#10a37f')

# Tick styling
ax.tick_params(colors='#e6e6e6')
ax.set_xticks([0, 0.5, 1])
ax.set_yticks([0, 0.5, 1])

for spine in ax.spines.values():
    spine.set_color('#555555')

fig.patch.set_facecolor('#222222')
ax.set_facecolor('#222222')

plt.tight_layout()
plt.show()
Figure 4: The shaded triangle represents the event \(X + Y \leq 1/2\) inside the unit square.

The probability equals the area of this triangle (see Figure 4):

\[ \mathrm{P}\left(X + Y \leq \frac{1}{2}\right) = \frac{1}{2} \times \frac{1}{2} \times \frac{1}{2} = \frac{1}{8}. \]

The procedure is identical to the discrete case in spirit: describe the sample space (the unit square), identify the event (the triangle), and compute the probability (the area). The only difference is that “counting” has been replaced by “calculating area.” For more complex regions, this area calculation might require integration, but the underlying logic does not change.

A Quick Recap

Let us pause and see how far we have come. We started by asking: how do we reason about uncertainty? The answer is a probabilistic model, which consists of three ingredients:

  1. A sample space \(\Omega\): the set of all possible outcomes.
  2. A collection of events: subsets of \(\Omega\) that we assign probabilities to.
  3. A probability law \(\mathrm{P}\): a function from events to numbers in \(\left[0, 1\right]\), satisfying the three axioms (nonnegativity, normalization, additivity).

We have seen this recipe in action for both discrete settings (the tetrahedral die, where probability = counting) and continuous settings (the dartboard, where probability = area). In both cases, the procedure is the same: set up the model, identify the event, and compute.

But there is one more subtlety we have not addressed. The additivity axiom, as stated, handles two disjoint events. We showed it extends to any finite number of disjoint events. But what happens when we need to handle infinitely many?

flowchart TD
    A["<b>Motivation</b><br/>Why do we need a framework<br/>for uncertainty?"]
    B["<b>Sample Space</b><br/>What can possibly happen?"]
    C["<b>Events</b><br/>What do we care about?"]
    D["<b>Probability Axioms</b><br/>The three rules of the game"]
    E["<b>Discrete Uniform Law</b><br/>Tetrahedral die example"]
    F["<b>Continuous Uniform Law</b><br/>Darts on a square"]
    G["<b>Countable Additivity</b><br/>When three axioms<br/>are not quite enough"]
    H["<b>Wrap-Up</b><br/>The complete recipe"]

    A --> B --> C --> D --> E --> F --> G --> H

    style A fill:#555555,color:#aaaaaa
    style B fill:#555555,color:#aaaaaa
    style C fill:#555555,color:#aaaaaa
    style D fill:#555555,color:#aaaaaa
    style E fill:#555555,color:#aaaaaa
    style F fill:#555555,color:#aaaaaa
    style G fill:#10a37f,color:#fff

Countable Additivity

A Problem That Finite Additivity Cannot Solve

Consider the following experiment: you keep flipping a fair coin repeatedly until you get Heads for the first time. The outcome of this experiment is the number of the flip on which Heads first appears. It could be 1 (Heads on the first flip), or 2 (Tails then Heads), or 37, or a million. There is no upper bound.

The sample space is the set of all positive integers:

\[ \Omega = \left\{1, 2, 3, \dots\right\}. \]

Suppose someone tells us that the probability of needing exactly \(n\) flips is:

\[ \mathrm{P}\left(n\right) = 2^{-n}, \quad n = 1, 2, 3, \dots \]

(We will derive this formula later in the series when we study independence. For now, just accept it as given.)

Code
import matplotlib.pyplot as plt
import numpy as np

plt.style.use('dark_background')

fig, ax = plt.subplots(1, 1, figsize=(8, 4))

n_values = np.arange(1, 9)
probs = 2.0 ** (-n_values)

colors = ['#10a37f' if n % 2 == 0 else '#555555' for n in n_values]

ax.bar(n_values, probs, color=colors, edgecolor='#777777', linewidth=0.8, width=0.6)

for i, (n, p) in enumerate(zip(n_values, probs)):
    ax.text(n, p + 0.01, f'$2^{{-{n}}}$', ha='center', va='bottom', fontsize=10, color='#e6e6e6')

ax.set_xlabel('$n$ (flip number)', fontsize=13, color='#e6e6e6')
ax.set_ylabel('$\\mathrm{P}(n)$', fontsize=13, color='#e6e6e6')
ax.set_xticks(n_values)
ax.set_ylim(0, 0.6)

# Add "..." to indicate continuation
ax.text(9.2, 0.01, '...', fontsize=16, color='#e6e6e6', ha='center')

ax.tick_params(colors='#e6e6e6')
for spine in ax.spines.values():
    spine.set_color('#555555')

fig.patch.set_facecolor('#222222')
ax.set_facecolor('#222222')

plt.tight_layout()
plt.show()
Figure 5: The probability \(\mathrm{P}(n) = 2^{-n}\) decreases geometrically. Even outcomes are highlighted in green.

Now suppose we are asked: what is the probability that the outcome is an even number?

Any reasonable person would write:

\[ \mathrm{P}\left(\left\{2, 4, 6, \dots\right\}\right) = \mathrm{P}\left(2\right) + \mathrm{P}\left(4\right) + \mathrm{P}\left(6\right) + \cdots = \frac{1}{2^2} + \frac{1}{2^4} + \frac{1}{2^6} + \cdots \]

This is a geometric series with first term \(a = 1/4\) and common ratio \(r = 1/4\). Its sum is:

\[ \frac{a}{1 - r} = \frac{1/4}{1 - 1/4} = \frac{1/4}{3/4} = \frac{1}{3}. \tag{4}\]

The answer is \(1/3\), as shown in Equation 4. This makes intuitive sense: because the coin is weighted toward landing Heads early (the probability drops off geometrically), it is more likely that the experiment ends on an odd flip than an even one.

But wait. Look at the step where we wrote the probability of an infinite set as the sum of the probabilities of its individual elements. We used Equation 2, which we derived for finite collections. Can we really extend it to an infinite sum?

The Countable Additivity Axiom

The answer is: not from the three axioms we have stated so far. Finite additivity tells us about the union of 2, or 10, or 10,000 disjoint events. It says nothing about an infinite union. To justify the step we just took, we need to strengthen Axiom 3.

ImportantThe Countable Additivity Axiom (Strengthened Axiom 3)

If \(A_1, A_2, A_3, \dots\) is a sequence of disjoint events (i.e., they can be listed as first, second, third, …, and no two share an outcome), then:

\[ \mathrm{P}\left(A_1 \cup A_2 \cup A_3 \cup \cdots\right) = \mathrm{P}\left(A_1\right) + \mathrm{P}\left(A_2\right) + \mathrm{P}\left(A_3\right) + \cdots \tag{5}\]

This is the countable additivity axiom. The word “countable” means that the events can be arranged in a sequence (first, second, third, …), i.e., they can be put in correspondence with the natural numbers. Note that this axiom is strictly stronger than finite additivity: it implies finite additivity (just set \(A_3, A_4, \dots\) to the empty set \(\varnothing\)), but not the other way around.

With this strengthened axiom, our calculation of \(\mathrm{P}\left(\text{outcome is even}\right) = 1/3\) is fully justified. We took the event \(\left\{2, 4, 6, \dots\right\}\), decomposed it into a countable sequence of disjoint single-element events \(\left\{2\right\}, \left\{4\right\}, \left\{6\right\}, \dots\), and applied Equation 5 to write the probability as an infinite sum.

TipA Note on “Weird Sets”

You might wonder: can we assign probabilities to every subset of \(\Omega\)? For finite or countably infinite sample spaces, the answer is yes. But for uncountable sample spaces like the unit square, it turns out that there exist some pathological subsets (constructed using the Axiom of Choice in set theory) that cannot be assigned probabilities in a way that is consistent with the axioms. These sets never arise in practice, and you will only encounter them if you study the measure-theoretic foundations of probability at the graduate level. For our purposes, we can safely ignore them. Every set we will ever work with in this series will have a well-defined probability.

Wrapping Up: The Complete Recipe

Let us return to where we started. You are standing at the door, wondering about the umbrella. We can now frame this as a probabilistic model.

  1. Sample space. The possible outcomes might be \(\Omega = \left\{\text{Rain}, \text{No Rain}\right\}\) for a simple model, or something more granular like \(\Omega = \left\{\text{Heavy Rain}, \text{Light Rain}, \text{Drizzle}, \text{No Rain}\right\}\) if you care about the intensity.
  2. Events. The event “I get wet” might correspond to \(\left\{\text{Heavy Rain}, \text{Light Rain}, \text{Drizzle}\right\}\).
  3. Probability law. Based on the weather forecast, past experience, or a meteorological model, you assign probabilities: maybe \(\mathrm{P}\left(\text{Heavy Rain}\right) = 0.05\), \(\mathrm{P}\left(\text{Light Rain}\right) = 0.15\), \(\mathrm{P}\left(\text{Drizzle}\right) = 0.10\), \(\mathrm{P}\left(\text{No Rain}\right) = 0.70\). These probabilities are nonnegative, they sum to 1, and they respect additivity. Then \(\mathrm{P}\left(\text{I get wet}\right) = 0.05 + 0.15 + 0.10 = 0.30\).

That is the whole recipe. Every probabilistic model, from the simplest coin flip to the most complex machine learning system, is built on these three ingredients and these three axioms (with countable additivity replacing finite additivity).

Here is a summary of the axioms in their final form:

NoteThe Axioms of Probability (Complete Version)

Given a sample space \(\Omega\) and a probability law \(\mathrm{P}\):

  1. (Nonnegativity) \(\mathrm{P}\left(A\right) \geq 0\) for every event \(A\).
  2. (Normalization) \(\mathrm{P}\left(\Omega\right) = 1\).
  3. (Countable Additivity) If \(A_1, A_2, \dots\) are pairwise disjoint events, then \(\mathrm{P}\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} \mathrm{P}\left(A_i\right)\).

In the next post, we will use these axioms to derive several useful properties of probabilities (such as the inclusion-exclusion formula), and then move on to one of the most powerful ideas in probability: conditioning. Conditioning is the tool that lets us update our beliefs when we learn new information, and it will transform the way we think about probability problems.

flowchart TD
    A["<b>Motivation</b><br/>Why do we need a framework<br/>for uncertainty?"]
    B["<b>Sample Space</b><br/>What can possibly happen?"]
    C["<b>Events</b><br/>What do we care about?"]
    D["<b>Probability Axioms</b><br/>The three rules of the game"]
    E["<b>Discrete Uniform Law</b><br/>Tetrahedral die example"]
    F["<b>Continuous Uniform Law</b><br/>Darts on a square"]
    G["<b>Countable Additivity</b><br/>When three axioms<br/>are not quite enough"]
    H["<b>Wrap-Up</b><br/>The complete recipe"]

    A --> B --> C --> D --> E --> F --> G --> H

    style H fill:#10a37f,color:#fff

References

Bertsekas, D., & Tsitsiklis, J. N. (2008). Introduction to probability (Vol. 1). Athena Scientific.
Blitzstein, J. K., & Hwang, J. (2019). Introduction to Probability. Chapman; Hall/CRC.
Kolmogoroff, A. (1933). Grundbegriffe der Wahrscheinlichkeitsrechnung.
MIT OpenCourseWare. (2014). 6.041 Probabilistic Systems Analysis and Applied Probability. https://youtube.com/playlist?list=PLUl4u3cNGP61MdtwGTqZA0MreSaDybji8&si=dAhEiX4O7IzqiN0j
MIT OpenCourseWare. (2018). MIT RES.6-012 Introduction to Probability, Spring 2018. https://youtube.com/playlist?list=PLUl4u3cNGP60hI9ATjSFgLZpbNJ7myAg6&si=SMI9zMClfJ1Iuj7I