61  Probability theory

Quantifying uncertainty

A component comes off a production line. You don’t know whether it’s defective. You test 1000 and 23 fail. What can you say about the next one? About the next batch of 500?

These questions have precise answers. Getting them requires a framework for describing uncertainty — assigning likelihoods to outcomes, combining them consistently, and extracting the quantities that are actually useful.

61.1 What this chapter helps you do

Symbols to keep handy

These are the bits of notation you'll see a lot. If a line of symbols feels like a fence, read it out loud once, then keep going.

  • Ω: omega — the sample space, the set of all possible outcomes

  • Cov(X,Y): the covariance of X and Y — how much they move together

  • Var(X): the variance of X — the average squared deviation from the mean

  • Φ(z): capital phi — the CDF of the standard normal distribution N(0,1)

  • F(x): the cumulative distribution function — P(X ≤ x)

  • E[X]: the expected value of X — the long-run average

  • P(A | B): the probability of A given that B has occurred

  • f(x): the probability density function — how likely values near x are

Definitions to keep handy

These are the words we keep coming back to. If one feels slippery, come back here and steady it before you push on.

  • sample space: The set of all possible outcomes for the process you’re modelling.

  • event: A set of outcomes you care about (for example: ‘rain tomorrow’ or ‘more than 6 requests’).

  • random variable: A rule that assigns a number to each outcome, so we can do arithmetic on uncertainty.

  • probability distribution: A complete description of how likely different outcomes/values are.

  • expected value: A long-run average: what you should get on average over many repeats.

  • variance / standard deviation: A measure of spread: how far values typically wander from the mean.

  • Central Limit Theorem (CLT): A result that explains why sums/averages of many small random effects often look approximately normal.

Here is the main move this chapter is making, in plain terms. You do not need to be fast. You just need to keep the thread.

  • Coming in: You have a process whose outcome you cannot predict exactly. You want to say something precise about the long run.

  • Leaving with: A probability distribution is a complete description of all possible outcomes and their likelihoods. Expectation, variance, and the CLT let you reason about sums, means, and rare events.

61.2 Probability axioms

61.2.1 Sample space and events

The sample space \Omega (read: “omega”) is the complete set of possible outcomes of an experiment. For a coin flip, \Omega = \{\text{heads}, \text{tails}\}. For rolling a six-sided die, \Omega = \{1, 2, 3, 4, 5, 6\}. For the lifetime of a component in hours, \Omega = [0, \infty).

An event A is any subset of \Omega — a collection of outcomes you care about. The event “roll an even number” is the subset \{2, 4, 6\} \subset \Omega.

61.2.2 Kolmogorov axioms

A probability measure P assigns a number to each event. Three axioms define what that assignment must satisfy:

  1. Non-negativity: P(A) \geq 0 for every event A.
  2. Normalisation: P(\Omega) = 1 — something must happen.
  3. Additivity: For mutually exclusive events A and B (events that cannot both occur), P(A \cup B) = P(A) + P(B).

These axioms are the whole foundation. Everything else is a consequence.

From them you can show:

  • P(\emptyset) = 0 (the impossible event has probability zero)
  • P(A^c) = 1 - P(A), where A^c is the complement of A (everything not in A)
  • If A \subset B then P(A) \leq P(B)
  • The inclusion-exclusion rule: P(A \cup B) = P(A) + P(B) - P(A \cap B)

The third axiom extends to any countable collection of mutually exclusive events (finitely many, or an infinite but listable sequence): P(A_1 \cup A_2 \cup \cdots) = P(A_1) + P(A_2) + \cdots

61.2.3 Conditional probability

The conditional probability of A given B — written P(A \mid B), read “probability of A given B” — is defined as:

P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0

This is the fraction of B’s probability that overlaps with A. If you know B has occurred, you’re restricting attention to the outcomes in B and asking what fraction of those are also in A. This means conditioning changes the universe you are measuring inside: once B is known to have happened, all probabilities are rescaled relative to B.

Two events A and B are independent if knowing B occurred tells you nothing about A:

P(A \mid B) = P(A) \iff P(A \cap B) = P(A) \cdot P(B)

The equivalence is just one substitution. Starting from P(A \mid B) = P(A) and using the definition of conditional probability:

\frac{P(A \cap B)}{P(B)} = P(A) \;\Longrightarrow\; P(A \cap B) = P(A)\,P(B)

Conversely, dividing P(A \cap B) = P(A)\,P(B) by P(B) (when P(B) > 0) returns P(A \mid B) = P(A).

61.2.4 Bayes’ theorem

Rearranging the definition of conditional probability:

P(A \cap B) = P(A \mid B) \cdot P(B) = P(B \mid A) \cdot P(A)

Solving for P(A \mid B):

P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}

This is Bayes’ theorem. It lets you reverse the conditioning: if you know P(B \mid A), you can compute P(A \mid B).

The denominator P(B) is expanded using the total probability rule. If A_1, A_2, \ldots, A_n partition \Omega (exhaustive, mutually exclusive), then:

P(B) = \sum_{i=1}^{n} P(B \mid A_i) \cdot P(A_i)

Why this works

Bayes’ theorem doesn’t involve any new mathematics — it’s just two ways of writing the same joint probability P(A \cap B), set equal. What makes it powerful is the direction: you have the likelihood P(B \mid A) (how probable is the evidence given the hypothesis?) and you want P(A \mid B) (how probable is the hypothesis given the evidence?). Bayes is the bridge between them.

Worked example: Medical test. A disease affects 1% of the population. A test for it has sensitivity 99% (correctly identifies 99% of sick patients) and specificity 95% (correctly clears 95% of healthy patients). A randomly selected person tests positive. What is the probability they actually have the disease?

Define: - D: the person has the disease - T^+: the test is positive

Given: P(D) = 0.01, P(T^+ \mid D) = 0.99 (sensitivity), P(T^+ \mid D^c) = 0.05

Note also that P(D^c) = 1 - P(D) = 1 - 0.01 = 0.99 — the complement of the prevalence, which happens to equal the sensitivity by coincidence.

First compute the total probability of a positive test, using the two ways a positive result can occur:

P(T^+) = P(T^+ \mid D) \cdot P(D) + P(T^+ \mid D^c) \cdot P(D^c) = \underbrace{0.99}_{\text{sensitivity}} \times 0.01 + 0.05 \times \underbrace{0.99}_{P(D^c) = 1 - 0.01} = 0.0099 + 0.0495 = 0.0594

Now apply Bayes:

P(D \mid T^+) = \frac{P(T^+ \mid D) \cdot P(D)}{P(T^+)} = \frac{0.99 \times 0.01}{0.0594} \approx 0.167

Even with a positive result from a 99%-sensitive test, the probability of actually having the disease is only about 17%. This is not a failure of the test — it’s a consequence of the disease being rare. Most of the positive tests come from the large healthy population, not the small sick one. The prior probability P(D) matters enormously.

61.3 Random variables

A random variable X is a function from the sample space \Omega to the real numbers — it assigns a numerical value to each outcome. The randomness is in the outcome; the function is deterministic.

Discrete random variables take values in a countable set \{x_1, x_2, \ldots\}. The distribution is completely described by the probability mass function (PMF):

p(x_i) = P(X = x_i), \quad \sum_i p(x_i) = 1

Continuous random variables take values in an interval (or union of intervals). The distribution is described by the probability density function (PDF) f(x), where:

P(a \leq X \leq b) = \int_a^b f(x)\, dx, \quad \int_{-\infty}^{\infty} f(x)\, dx = 1

Note that f(x) is not a probability — it can exceed 1. It’s a density: the probability of X falling in a small interval [x, x + dx] is approximately f(x)\, dx.

A concrete example: if X \sim \text{Uniform}(0,\, 0.5), then f(x) = 2 everywhere on [0, 0.5]. The density is 2, yet P(0 \leq X \leq 0.5) = \int_0^{0.5} 2\, dx = 1. The density can exceed 1; the integral over any interval is still between 0 and 1. This means the density is a probability-per-unit-length, not a probability by itself. Probabilities come from area under the curve, not from the height alone.

The cumulative distribution function (CDF) is defined for both types:

F(x) = P(X \leq x)

For continuous X: F(x) = \int_{-\infty}^{x} f(t)\, dt, and f(x) = F'(x).

61.3.1 Expectation

The expected value E[X] — also written \mu (mu) — is the long-run average of X over many repetitions:

Discrete: \displaystyle E[X] = \sum_i x_i \, p(x_i)

Continuous: \displaystyle E[X] = \int_{-\infty}^{\infty} x \, f(x)\, dx

Expectation is linear: E[aX + b] = a\,E[X] + b for constants a, b.

For a function g(X):

E[g(X)] = \sum_i g(x_i)\, p(x_i) \quad \text{(discrete)}

E[g(X)] = \int_{-\infty}^{\infty} g(x)\, f(x)\, dx \quad \text{(continuous)}

61.3.2 Variance

The variance \text{Var}(X) — also written \sigma^2 (sigma squared) — measures the average squared deviation from the mean:

\text{Var}(X) = E\!\left[(X - \mu)^2\right] = E[X^2] - (E[X])^2

The second form is usually easier to compute. The derivation:

E\!\left[(X - \mu)^2\right] = E[X^2 - 2\mu X + \mu^2] = E[X^2] - 2\mu\,E[X] + \mu^2

Since E[X] = \mu, the last two terms combine: -2\mu \cdot \mu + \mu^2 = -2\mu^2 + \mu^2 = -\mu^2. So:

= E[X^2] - \mu^2

This means variance is the average squared distance from the mean. The alternative formula E[X^2] - \mu^2 is usually easier to compute, but it is measuring the same spread.

The standard deviation \sigma = \sqrt{\text{Var}(X)} is in the same units as X, which makes it interpretable.

Variance scales with the square: \text{Var}(aX + b) = a^2\,\text{Var}(X).

For independent random variables X and Y: \text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y).

61.4 Key discrete distributions

61.4.1 Bernoulli

A single trial with probability p of success, 1-p of failure.

P(X = 1) = p, \quad P(X = 0) = 1 - p

E[X] = p, \quad \text{Var}(X) = p(1-p)

The Bernoulli is the building block for everything that follows.

61.4.2 Binomial

n independent Bernoulli trials, each with success probability p. X counts the total number of successes.

P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \ldots, n

The binomial coefficient \binom{n}{k} — read “n choose k” — counts the number of ways to arrange k successes among n trials:

\binom{n}{k} = \frac{n!}{k!(n-k)!}

Mean and variance. Since X is the sum of n independent Bernoulli trials, linearity of expectation and variance additivity give:

E[X] = np, \quad \text{Var}(X) = np(1-p)

Normal approximation. When np \geq 5 and n(1-p) \geq 5, the binomial is well approximated by a normal distribution with the same mean and variance. This is one preview of the CLT, which appears later in this chapter and explains why sums and averages often become approximately normal.

61.4.3 Poisson

Models the number of events occurring in a fixed interval of time or space, when events occur independently at a constant average rate.

P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \ldots

The parameter \lambda > 0 — read “lambda” — is both the mean and the variance:

E[X] = \lambda, \quad \text{Var}(X) = \lambda

As a limit of Binomial. Set \lambda = np and let n \to \infty, p \to 0. The binomial PMF converges to the Poisson. This is why the Poisson appears when events are rare but trials are many: the number of typing errors per page, radioactive decays per second, server requests per minute.

Derivation of the mean. Using the Poisson PMF:

E[X] = \sum_{k=0}^{\infty} k \cdot \frac{\lambda^k e^{-\lambda}}{k!} = e^{-\lambda} \sum_{k=1}^{\infty} \frac{\lambda^k}{(k-1)!} = e^{-\lambda} \cdot \lambda \sum_{j=0}^{\infty} \frac{\lambda^j}{j!} = e^{-\lambda} \cdot \lambda \cdot e^{\lambda} = \lambda

61.5 Key continuous distributions

61.5.1 Uniform

X \sim \text{Uniform}(a, b) — every value in [a,b] is equally likely.

f(x) = \frac{1}{b-a}, \quad a \leq x \leq b

E[X] = \frac{a+b}{2}, \quad \text{Var}(X) = \frac{(b-a)^2}{12}

The CDF is: F(x) = (x-a)/(b-a) for a \leq x \leq b.

61.5.2 Exponential

Models the time until the first event in a Poisson process — waiting times, component lifetimes, time between arrivals.

f(x) = \lambda e^{-\lambda x}, \quad x \geq 0

F(x) = 1 - e^{-\lambda x}

Derivation of mean. Integrate by parts:

E[X] = \int_0^{\infty} x \cdot \lambda e^{-\lambda x}\, dx

Let u = x, dv = \lambda e^{-\lambda x}\, dx. Then du = dx, v = -e^{-\lambda x}.

E[X] = \left[-x e^{-\lambda x}\right]_0^{\infty} + \int_0^{\infty} e^{-\lambda x}\, dx = 0 + \left[-\frac{1}{\lambda} e^{-\lambda x}\right]_0^{\infty} = \frac{1}{\lambda}

Derivation of variance. First compute E[X^2]:

E[X^2] = \int_0^{\infty} x^2 \cdot \lambda e^{-\lambda x}\, dx = \frac{2}{\lambda^2}

One quick route is integration by parts twice. Let u=x^2 and dv=\lambda e^{-\lambda x}\,dx, so du=2x\,dx and v=-e^{-\lambda x}:

E[X^2] = \left[-x^2 e^{-\lambda x}\right]_0^\infty + 2\int_0^\infty x e^{-\lambda x}\,dx

The boundary term is 0, and the remaining integral is the same type used in the mean calculation. Evaluating it gives 2/\lambda^2.

\text{Var}(X) = E[X^2] - (E[X])^2 = \frac{2}{\lambda^2} - \frac{1}{\lambda^2} = \frac{1}{\lambda^2}

Memoryless property. The exponential is the only continuous distribution with no memory:

P(X > s + t \mid X > s) = P(X > t) \quad \text{for all } s, t \geq 0

If the component has already survived s hours, the probability it survives another t hours is the same as if it were brand new. The past waiting time gives no information about the future.

Proof. Using the survival function P(X > x) = e^{-\lambda x}:

P(X > s+t \mid X > s) = \frac{P(X > s+t)}{P(X > s)} = \frac{e^{-\lambda(s+t)}}{e^{-\lambda s}} = e^{-\lambda t} = P(X > t) \quad \checkmark

61.5.3 Normal

X \sim N(\mu, \sigma^2) — the most important distribution in applied probability, for reasons the CLT will make clear.

f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

Parameters: \mu is the mean, \sigma^2 is the variance.

E[X] = \mu, \quad \text{Var}(X) = \sigma^2

There is no closed form for the CDF — it’s evaluated numerically and tabulated as the standard normal CDF \Phi(z), where \Phi (capital phi) is the CDF of Z \sim N(0,1).

Standardisation. Any normal X \sim N(\mu, \sigma^2) can be converted to the standard normal by:

Z = \frac{X - \mu}{\sigma}

Then P(X \leq x) = P\!\left(Z \leq \frac{x-\mu}{\sigma}\right) = \Phi\!\left(\frac{x-\mu}{\sigma}\right).

The transformation Z = (X - \mu)/\sigma — “subtract the mean, divide by the standard deviation” — centres the distribution at zero and scales it to unit variance. Every probability question about X \sim N(\mu, \sigma^2) reduces to looking up a value in the standard normal table.

61.6 Joint distributions

For two random variables X and Y, the joint distribution describes their behaviour together. The key concept for most applications is independence.

X and Y are independent if knowing the value of one gives no information about the other:

f_{X,Y}(x,y) = f_X(x) \cdot f_Y(y)

That is, the joint density factors into the product of the marginals.

Covariance. A measure of how X and Y move together:

\text{Cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - E[X]\,E[Y]

If X and Y are independent, E[XY] = E[X]\,E[Y], so \text{Cov}(X,Y) = 0. (The converse is not generally true: zero covariance does not imply independence.)

Correlation coefficient. Covariance is scale-dependent — multiplying X by 2 doubles the covariance. The correlation normalises this:

\rho(X,Y) = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}, \quad -1 \leq \rho \leq 1

\rho = 1 means perfect positive linear relationship; \rho = -1 means perfect negative; \rho = 0 means no linear relationship.

Variance of a sum. For any X and Y (not necessarily independent):

\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\,\text{Cov}(X,Y)

If they are independent, the covariance term vanishes.

61.7 Central Limit Theorem

Let X_1, X_2, \ldots, X_n be independent, identically distributed random variables with mean \mu and variance \sigma^2 < \infty. Define the sample mean:

\bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i

Central Limit Theorem (CLT): As n \to \infty,

\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} N(0,1)

Read \xrightarrow{d} as “converges in distribution to” — as n grows, the distribution of the standardised mean gets closer and closer to the standard normal.

In practice: for large n, the distribution of \bar{X}_n is approximately N(\mu,\, \sigma^2/n).

Three things the CLT is saying:

  1. The sample mean converges to the true mean \mu — not just as a hope, but with a rate: the spread is \sigma/\sqrt{n}, shrinking like 1/\sqrt{n}.

  2. The limiting distribution is always normal, regardless of the distribution of the individual X_i. You do not need X_i to be normal — exponential, uniform, Bernoulli, anything — the average becomes normal.

  3. The only requirements are: independent, identically distributed (abbreviated i.i.d.), finite variance. No further conditions.

Conditions for practical use. The approximation is usually adequate when n \geq 30, and excellent for n \geq 50. For distributions that are already close to normal, smaller n suffices. For very skewed distributions (e.g. heavy-tailed), larger n may be needed.

Worked example: Sample mean from an exponential population.

Components have lifetimes X_i \sim \text{Exp}(\lambda = 0.5), so \mu = 1/\lambda = 2 hours and \sigma^2 = 1/\lambda^2 = 4.

Take a sample of n = 50 components. By the CLT, the sample mean is approximately:

\bar{X}_{50} \approx N\!\left(2,\, \frac{4}{50}\right) = N(2,\, 0.08)

Standard deviation of the mean: \sigma/\sqrt{n} = 2/\sqrt{50} \approx 0.283.

Find P(\bar{X}_{50} > 2.4):

P(\bar{X}_{50} > 2.4) = P\!\left(Z > \frac{2.4 - 2}{0.283}\right) = P(Z > 1.414) = 1 - \Phi(1.414) \approx 0.079

There is about an 8% chance the sample mean exceeds 2.4 hours. The individual lifetimes are exponential and highly right-skewed — but the average over 50 of them behaves almost exactly like a normal random variable.

61.8 Where this goes

This chapter provides the foundation for Chapter 2: Mathematical statistics (this volume). Everything in that chapter — maximum likelihood estimation, confidence intervals, hypothesis tests — takes the distributions developed here and uses them to make inferences about unknown parameters from data. The CLT is the engine: it justifies using normal-distribution machinery on sample means regardless of the underlying population distribution, which is why z-tests and t-tests work in practice.

The connections extend further into other sections of Vol 7. The Poisson distribution arises in queuing theory (a special case of stochastic processes). The normal distribution underlies error analysis in numerical methods: when you propagate measurement uncertainties through a computation, the CLT explains why the output errors are approximately normal. In engineering statistics, the same distributions appear in reliability theory, quality control (Six Sigma thresholds are expressed in \sigma units), and signal detection.

Where this shows up

  • A reliability engineer models component lifetimes as exponential and uses the memoryless property to compute replacement schedules.
  • An actuary prices insurance using the Poisson distribution to model the number of claims per period.
  • A machine learning engineer interprets model output probabilities using the Bayes framework — the model’s output is P(Y \mid X), not P(X \mid Y).
  • A signal processing engineer applies the CLT to justify that thermal noise in electronic circuits is modelled as Gaussian.
  • A quality control engineer uses the binomial distribution to decide whether a batch rejection threshold is appropriate for a given defect rate.

61.9 Exercises

These are puzzles. Each has a clean numerical answer. The interesting part is identifying which distribution applies and setting up the probability correctly.


Exercise 1. A factory has two machines. Machine A produces 60% of output and has a 4% defect rate. Machine B produces the remaining 40% and has a 1% defect rate. An inspector picks a component at random and finds it defective. What is the probability it came from Machine A?


Exercise 2. A quality control test checks batches of 8 components. Each component, independently, has a failure probability of 0.1. Find the probability that at least 3 components in a batch fail.


Exercise 3. A call centre receives calls at an average rate of 4 per minute. In a 30-second window, what is the probability of receiving 3 or more calls?


Exercise 4. The time to failure of a device follows an exponential distribution with rate \lambda = 0.02 per hour. Find: (a) P(T > 100), (b) E[T], (c) \text{Var}(T).


Exercise 5. A manufacturing process produces items whose length X \sim N(75, 100) mm (mean 75 mm, variance 100 mm²). The specification requires length greater than 85 mm. Find P(X > 85).


Exercise 6. Let X_1, X_2, \ldots, X_{50} be i.i.d. \text{Exp}(1) random variables. Their mean is \mu = 1 and variance is \sigma^2 = 1. Use the CLT to approximate P(\bar{X}_{50} > 1.2).