62  Mathematical statistics

Inference from data

A production line fills bottles to a stated volume. A sample of 25 bottles shows a mean fill of 497 mL against a target of 500 mL. Is that a real problem, or just expected random variation? You need a principled way to decide.

A new catalyst cuts reaction time in a lab test. But the test used only eight runs. How confident are you that the effect is real and not noise?

You have data. The question is what the data can actually tell you — and, just as importantly, what it cannot.

62.1 What this chapter helps you do

Symbols to keep handy

These are the bits of notation you'll see a lot. If a line of symbols feels like a fence, read it out loud once, then keep going.

  • (): ell of theta — the log-likelihood function

  • H_0,; H_1: H-naught and H-one — the null and alternative hypotheses

  • _0,; _1: beta-hat-zero and beta-hat-one — the OLS intercept and slope estimates

  • t_{/2,,n-1}: t at alpha over two, n minus one degrees of freedom

  • R^2: R-squared — the coefficient of determination

Definitions to keep handy

These are the words we keep coming back to. If one feels slippery, come back here and steady it before you push on.

  • estimator: A rule that turns data into an estimate of an unknown quantity.

  • likelihood: A function that scores parameter values by how well they explain the observed data.

  • confidence interval: A range of plausible parameter values with a stated long-run coverage guarantee.

  • hypothesis test: A formal procedure for deciding whether data is compatible with a baseline claim.

  • p-value: If the null hypothesis were true, the probability of seeing a result at least this extreme.

  • regression: A model for how an output variable changes with one or more inputs, plus an error term.

Here is the main move this chapter is making, in plain terms. You do not need to be fast. You just need to keep the thread.

  • Coming in: You have data sampled from an unknown distribution. You want to draw reliable conclusions about the population.

  • Leaving with: Point estimates, confidence intervals, and hypothesis tests are all formal ways to quantify what the data can and cannot tell you, with explicit control over the probability of being wrong.

62.2 Point estimation

You observe n independent realisations x_1, x_2, \ldots, x_n from a distribution that has an unknown parameter \theta — a proportion, a rate, a mean. A point estimate \hat{\theta} — read “theta-hat” — is a single number computed from the sample to serve as your best guess at \theta.

The most systematic method for finding point estimates is maximum likelihood estimation (MLE): choose the value of \theta that makes the observed data most probable.

62.2.1 Setting up the likelihood

If the x_i are i.i.d. (independent, identically distributed) from a distribution with density (or mass function) f(x; \theta), the likelihood is the joint probability of the observed data, viewed as a function of \theta:

L(\theta) = \prod_{i=1}^{n} f(x_i; \theta)

Read this as: the product, over all n observations, of the density evaluated at each x_i. The MLE is the \theta that maximises this.

Products are hard to differentiate. Take the natural log — it’s monotone, so the maximum is preserved. The log-likelihood is:

\ell(\theta) = \log L(\theta) = \sum_{i=1}^{n} \log f(x_i;\theta)

Read “\ell(\theta)” as “ell of theta”. Now differentiate and set equal to zero.

62.2.2 MLE for the Bernoulli distribution

Let X_i \sim \text{Bernoulli}(p): each observation is 1 (success) or 0 (failure), with unknown success probability p. Then f(x; p) = p^x (1-p)^{1-x}, so:

For x=1, this gives f(1;p)=p. For x=0, it gives f(0;p)=1-p. So this compact formula is just the Bernoulli mass function written in one line.

\ell(p) = \sum_{i=1}^n \left[ x_i \log p + (1-x_i)\log(1-p) \right] = \left(\sum x_i\right)\log p + \left(n - \sum x_i\right)\log(1-p)

Let S = \sum x_i (the count of successes). Differentiate with respect to p and set to zero:

\frac{d\ell}{dp} = \frac{S}{p} - \frac{n - S}{1-p} = 0

Multiply through by p(1-p):

S(1-p) - (n-S)p = 0 \implies S - Sp - np + Sp = 0 \implies S = np

\boxed{\hat{p} = \frac{S}{n} = \bar{X}}

The MLE is the sample proportion. This confirms the intuition. This means maximum likelihood recovers the estimate you would probably guess from common sense: the best estimate of a success probability is the observed fraction of successes.

62.2.3 MLE for the Exponential distribution

Let X_i \sim \text{Exp}(\lambda): density f(x;\lambda) = \lambda e^{-\lambda x} for x \geq 0. The log-likelihood is:

\ell(\lambda) = \sum_{i=1}^n \left[\log\lambda - \lambda x_i\right] = n\log\lambda - \lambda \sum_{i=1}^n x_i

Differentiate and set to zero:

\frac{d\ell}{d\lambda} = \frac{n}{\lambda} - \sum x_i = 0 \implies \lambda = \frac{n}{\sum x_i}

\boxed{\hat{\lambda} = \frac{1}{\bar{X}}}

The MLE of the rate is the reciprocal of the sample mean. If the mean lifetime in the sample is 5 hours, you estimate the failure rate as 0.2 failures per hour.

62.2.4 Unbiasedness and consistency

A point estimate is unbiased if E[\hat{\theta}] = \theta — the estimate is correct on average. The sample mean \bar{X} is unbiased for \mu; the sample variance s^2 = \frac{1}{n-1}\sum(X_i - \bar{X})^2 is unbiased for \sigma^2 (the n-1 denominator matters here — the n version is biased). The reason is that \bar{X} is estimated from the same data: once the mean is fixed, only n-1 of the deviations (X_i - \bar{X}) are free to vary, so the sum of squared deviations has n-1 degrees of freedom, not n.

An estimate is consistent if \hat{\theta} \to \theta in probability as n \to \infty — more data drives the estimate to the truth. The MLE is consistent under mild regularity conditions.

Why MLE works

The likelihood is the probability of your data given \theta. Maximising it over \theta answers: “which parameter value would make this data least surprising?” That question has a principled answer even when no other method is obvious — just write down the distribution, form the product, and differentiate.

62.3 Confidence intervals

A point estimate is a single number. It conceals the uncertainty. A confidence interval (CI) pairs the estimate with a margin that quantifies how much the estimate might be off.

62.3.1 The problem with small samples

If \sigma (the population standard deviation) were known, the sampling distribution of \bar{X} is exactly N(\mu, \sigma^2/n), and you could build an interval using the standard normal z-distribution. But \sigma is almost never known — you estimate it from the sample as s. When n is small, this extra source of uncertainty matters.

Student’s t-distribution handles this. The statistic

T = \frac{\bar{X} - \mu}{s/\sqrt{n}}

follows a t-distribution with \nu = n-1 degrees of freedom.

The name “Student” comes from a pseudonym used by William Gosset, a statistician working for Guinness who published under that pen name.

Read “s/\sqrt{n}” as the standard error of the mean — the typical spread of sample means around the true mean. The t-distribution looks like the standard normal but with heavier tails; as \nu \to \infty it converges to the normal.

62.3.2 Constructing a 95% CI for \mu

Let t_{\alpha/2,\,n-1} be the upper \alpha/2 critical value of the t-distribution with n-1 degrees of freedom — the value such that P(T > t_{\alpha/2,\,n-1}) = \alpha/2. For a 95% CI, \alpha = 0.05.

Inverting |T| \leq t_{\alpha/2,\,n-1} gives the interval. Substituting T = (\bar{X} - \mu)/(s/\sqrt{n}) and expanding the absolute value:

-t_{\alpha/2,\,n-1} \leq \frac{\bar{X} - \mu}{s/\sqrt{n}} \leq t_{\alpha/2,\,n-1}

Multiply through by s/\sqrt{n} and rearrange to isolate \mu:

\bar{X} - t_{\alpha/2,\,n-1}\cdot\frac{s}{\sqrt{n}} \;\leq\; \mu \;\leq\; \bar{X} + t_{\alpha/2,\,n-1}\cdot\frac{s}{\sqrt{n}}

Written compactly:

\bar{X} \pm t_{\alpha/2,\,n-1} \cdot \frac{s}{\sqrt{n}}

This is the 95% confidence interval for \mu. This means a confidence interval is always doing the same structural job: estimate at the centre, uncertainty as the margin.

Worked example. A materials lab measures Young’s modulus for 10 specimens of a polymer. The sample mean is \bar{x} = 3.42 GPa and the sample standard deviation is s = 0.18 GPa. The t-table gives t_{0.025,\,9} = 2.262.

\text{CI} = 3.42 \pm 2.262 \times \frac{0.18}{\sqrt{10}} = 3.42 \pm 2.262 \times 0.0569 = 3.42 \pm 0.129

The 95% CI is (3.29,\; 3.55) GPa.

62.3.3 What a confidence interval means — and what it does not

The 95% CI does not mean “there is a 95% probability that \mu lies in this interval.” After the data are observed, \mu is either in the interval or it isn’t — there is no probability involved.

The correct interpretation: if you repeated this experiment many times and built a 95% CI each time, about 95% of those intervals would contain the true \mu. The procedure has 95% coverage. Any particular interval is just one instance of that procedure.

Use the board below to see that long-run coverage idea directly.

62.4 Hypothesis testing

Point estimates and CIs describe what the data show. Hypothesis testing asks a sharper question: is there enough evidence to reject a specific claim?

62.4.1 The framework

The null hypothesis H_0 is the claim under scrutiny — the default, conservative position. The alternative hypothesis H_1 is what you’d conclude if H_0 is implausible given the data.

A test statistic is a function of the data whose distribution under H_0 is known. You compute the observed value, then ask: if H_0 were true, how often would you see a value this extreme or more extreme? That probability is the p-value.

If the p-value is below the significance level \alpha (typically 0.05), you reject H_0.

Two error types:

H_0 true H_0 false
Reject H_0 Type I error (false positive), probability \alpha Correct
Fail to reject H_0 Correct Type II error (false negative), probability \beta

Choosing \alpha = 0.05 means you’re willing to falsely reject a true H_0 in 5% of experiments. Reducing \alpha reduces Type I errors but increases Type II errors. The power of a test is 1 - \beta — the probability of correctly detecting a real effect.

62.4.2 One-sample t-test

Hypotheses. You want to test whether a population mean \mu equals a specified value \mu_0: H_0: \mu = \mu_0 \qquad H_1: \mu \neq \mu_0 \quad \text{(two-sided)}

Test statistic.

t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}

Under H_0 this follows t_{n-1}.

Decision rule. Reject H_0 if |t| > t_{\alpha/2,\,n-1}.

Worked example. A valve is designed to open at 200 kPa. Quality control samples n = 20 valves. The sample mean opening pressure is \bar{x} = 203.4 kPa with s = 5.2 kPa. Test H_0: \mu = 200 at \alpha = 0.05.

t = \frac{203.4 - 200}{5.2/\sqrt{20}} = \frac{3.4}{1.163} = 2.923

The critical value is t_{0.025,\,19} = 2.093. Since 2.923 > 2.093, reject H_0. There is statistically significant evidence that the mean opening pressure differs from 200 kPa.

62.4.3 Two-sample t-test

To compare means \mu_1 and \mu_2 from two independent populations (assuming equal variances), form the pooled variance:

s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}

The test statistic is:

t = \frac{\bar{x}_1 - \bar{x}_2}{s_p\sqrt{1/n_1 + 1/n_2}}

Under H_0: \mu_1 = \mu_2, this follows t_{n_1+n_2-2}. Reject when |t| > t_{\alpha/2,\,n_1+n_2-2}.

Welch’s t-test removes the equal-variance assumption by adjusting the degrees of freedom; it’s the default in most software. The pooled test is a little more powerful when equal variances really are plausible, while Welch is safer when they may not be.

62.4.4 \chi^2 goodness-of-fit

When the data are counts in categories, the \chi^2 statistic tests whether the observed frequencies are consistent with a specified distribution. For k categories with observed counts O_i and expected counts E_i = n p_i:

\chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i}

Under H_0, this follows approximately \chi^2_{k-1}. Reject if \chi^2 > \chi^2_{\alpha,\,k-1}. The rule of thumb E_i \geq 5 in every cell ensures the approximation is reliable.

62.5 Linear regression

Data rarely have a single variable of interest in isolation. Regression asks how one variable changes with another — and how reliably.

62.5.1 The model

The simple linear regression model is:

Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \quad i = 1, \ldots, n

Y_i is the response (outcome), x_i is the predictor (input), \beta_0 is the intercept (read “beta-zero” — the expected response when x = 0), \beta_1 is the slope (read “beta-one” — the change in expected response per unit increase in x), and \varepsilon_i is the error term, assumed i.i.d. N(0, \sigma^2).

62.5.2 OLS estimators

Ordinary least squares minimises the sum of squared residuals:

\text{SS}_{res} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2

Read “\text{SS}_{res}” as “residual sum of squares” — the total vertical distance squared between the data points and the fitted line.

Differentiate \text{SS}_{res} with respect to \hat{\beta}_0 and \hat{\beta}_1 and set both partial derivatives to zero. This gives the normal equations. Here “normal” means perpendicular or orthogonal, not “normal distribution”:

\frac{\partial\,\text{SS}_{res}}{\partial\hat{\beta}_0} = -2\sum(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0

\frac{\partial\,\text{SS}_{res}}{\partial\hat{\beta}_1} = -2\sum x_i(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0

From the first equation: distribute the sum and use \sum_{i=1}^n 1 = n:

\sum(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0 \implies n\bar{y} - n\hat{\beta}_0 - \hat{\beta}_1 \cdot n\bar{x} = 0

Divide by n: \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}.

Substituting into the second and solving:

\boxed{\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{S_{xy}}{S_{xx}}}

\boxed{\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}}

where S_{xy} = \sum(x_i - \bar{x})(y_i - \bar{y}) and S_{xx} = \sum(x_i - \bar{x})^2. This means the fitted line is pinned to the centre of the data cloud at (\bar{x},\bar{y}) and tilts according to how strongly x and y move together.

62.5.3 R^2 — coefficient of determination

Define the total sum of squares \text{SS}_{tot} = \sum(y_i - \bar{y})^2 and the residual sum of squares \text{SS}_{res} = \sum(y_i - \hat{y}_i)^2. The coefficient of determination is:

R^2 = 1 - \frac{\text{SS}_{res}}{\text{SS}_{tot}}

R^2 = 1 means the line passes through every data point. R^2 = 0 means the line explains none of the variation — it is no better than simply predicting \bar{y} for every observation. In simple linear regression, R^2 = r^2 where r is the Pearson correlation coefficient, the sample analogue of the population correlation \rho from the previous chapter.

62.5.4 t-test for \beta_1

Under the model assumptions, the standardised slope estimate

t = \frac{\hat{\beta}_1 - 0}{s_e / \sqrt{S_{xx}}}

follows t_{n-2} under H_0: \beta_1 = 0, where s_e^2 = \text{SS}_{res}/(n-2) is the estimated error variance. The denominator is n-2 rather than n-1 because two parameters (\beta_0 and \beta_1) are estimated from the data, consuming two degrees of freedom. Reject H_0 if |t| > t_{\alpha/2,\,n-2}. Failing to reject means there is no statistically significant linear relationship between x and Y.

62.5.5 Residual analysis

The regression assumptions — linearity, constant variance, normality of errors, independence — are checked through the residuals \hat{\varepsilon}_i = y_i - \hat{y}_i.

Plot residuals against fitted values \hat{y}_i:

  • Random scatter around zero — assumptions satisfied.
  • Curved pattern — linearity is violated; consider a transformation or a polynomial term.
  • Fan shape (variance increasing with \hat{y}_i) — heteroscedasticity; consider a log or square-root transformation of Y.
  • Outliers — single points far from zero; investigate individually.

A normal Q-Q plot of residuals checks the normality assumption. Points falling close to the diagonal line support the assumption.

Use the sliders to test lines against the data, then snap to the least-squares fit and compare the change in SSR and R^2.

62.6 One-way ANOVA

When you want to compare means across three or more groups, running multiple t-tests inflates the Type I error rate. One-way ANOVA (analysis of variance) tests all groups simultaneously.

62.6.1 The F-test

Suppose you have k groups, with n_j observations in group j. Let \bar{y}_{j\cdot} denote the mean of group j and \bar{y} the overall mean. Define:

\text{SS}_{between} = \sum_{j=1}^k n_j (\bar{y}_{j\cdot} - \bar{y})^2 \qquad \text{(variation explained by group membership)}

\text{SS}_{within} = \sum_{j=1}^k \sum_{i=1}^{n_j} (y_{ij} - \bar{y}_{j\cdot})^2 \qquad \text{(variation within groups)}

The mean squares are \text{MS}_{between} = \text{SS}_{between}/(k-1) and \text{MS}_{within} = \text{SS}_{within}/(N-k), where N = \sum n_j.

The F-statistic is:

F = \frac{\text{MS}_{between}}{\text{MS}_{within}}

Under H_0: \mu_1 = \mu_2 = \cdots = \mu_k, F follows F_{k-1,\,N-k}. Reject H_0 if F > F_{\alpha,\,k-1,\,N-k}. A large F means between-group variation is large relative to within-group variation — the groups genuinely differ.

Why F instead of multiple t-tests?

With k = 3 groups, running all pairwise t-tests at \alpha = 0.05 gives a familywise error rate of approximately 1 - (0.95)^3 \approx 0.14. You’d falsely reject a true null about 14% of the time rather than 5%. ANOVA controls this by testing all groups in a single F-test. If ANOVA rejects H_0, post-hoc pairwise tests (Tukey HSD, Bonferroni) then identify which groups differ.

Engineering application. Three machine settings — A, B, C — produce polymer film. Quality engineers measure tensile strength (MPa) from five runs at each setting:

Setting A Setting B Setting C
41.2 38.5 44.1
40.8 39.2 43.8
42.1 37.9 45.0
41.5 38.8 44.3
40.4 39.6 43.4

Group means: \bar{y}_{A} = 41.2, \bar{y}_{B} = 38.8, \bar{y}_{C} = 44.12. Overall mean \bar{y} = 41.37.

\text{SS}_{between} = 5(41.2-41.37)^2 + 5(38.8-41.37)^2 + 5(44.12-41.37)^2 = 5(0.0289) + 5(6.6049) + 5(7.5625) = 0.1445 + 33.02 + 37.81 = 70.98

\text{MS}_{between} = 70.98/2 = 35.49

Computing \text{SS}_{within} from deviations within each group gives approximately 4.61, so \text{MS}_{within} = 4.61/12 = 0.384.

F = \frac{35.49}{0.384} = 92.4

The critical value F_{0.05,\,2,\,12} = 3.89. Since 92.4 \gg 3.89, reject H_0. The machine settings produce significantly different tensile strengths — the effect is not noise.

62.7 Where this goes

The methods here assume a parametric form for the distributions — Bernoulli, exponential, normal. Non-parametric tests (Wilcoxon, Mann-Whitney, Kruskal-Wallis) make weaker assumptions at the cost of some power; they become relevant when distributions are heavily skewed or sample sizes are very small. Bayesian inference reframes the problem entirely: instead of asking “what does the data tell us about \theta?” it asks “given a prior belief about \theta and the data, what should our updated belief be?” Both extensions build on exactly the likelihood functions you derived here.

Multiple linear regression extends the single-predictor model to Y = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p + \varepsilon. The OLS estimator in matrix form is \hat{\boldsymbol{\beta}} = (X^TX)^{-1}X^T\mathbf{y}, which requires the linear algebra from Volume 7’s linear algebra section. Generalised linear models (GLMs) then extend regression to non-normal responses — logistic regression for binary outcomes, Poisson regression for counts — connecting directly to the MLE framework you already have.

Where this shows up

  • Reliability engineering: MLE estimates failure rates from field data; confidence intervals bound the estimate; hypothesis tests determine whether a design change made a difference.
  • Clinical trials: two-sample t-tests compare treatment and control groups; sample size is set to achieve target power.
  • Process control: t-tests on subgroup means detect shifts in a manufacturing process mean; ANOVA compares multiple line configurations simultaneously.
  • Geophysics: linear regression of seismic travel time against distance estimates subsurface velocity; the t-test for \beta_1 determines whether the relationship is real.
  • Machine learning: the OLS solution is closed-form linear regression; R^2 and residual analysis evaluate model fit before any regularisation is added.

62.8 Exercises

These are puzzles. Each has a clean numerical answer. The work is in setting up the likelihood, test statistic, or normal equations correctly before computing.


Exercise 1. An engineer records the time to failure (hours) for five identical electronic components under accelerated testing:

x_1 = 820, \quad x_2 = 1\,040, \quad x_3 = 650, \quad x_4 = 930, \quad x_5 = 760

Assuming lifetimes are Exponential(\lambda), find \hat{\lambda} (the MLE) and its standard error. Use the fact that for the Exponential MLE, \text{SE}(\hat{\lambda}) = \hat{\lambda}/\sqrt{n}.


Exercise 2. A tensile test on 16 specimens of an aluminium alloy yields a sample mean yield strength of \bar{x} = 48.2 MPa and sample standard deviation s = 6.4 MPa. Construct a 95% confidence interval for the population mean yield strength. Use t_{0.025,\,15} = 2.131.


Exercise 3. A CNC milling process is specified to produce shafts with a mean diameter of 50.00 mm. A quality technician measures a random sample of 25 shafts and finds \bar{x} = 47.8 mm and s = 5.1 mm. Test the null hypothesis H_0: \mu = 50 against H_1: \mu \neq 50 at significance level \alpha = 0.05. Use t_{0.025,\,24} = 2.064. State your conclusion.


Exercise 4. Two aluminium alloys are compared for tensile strength. Alloy 1 (n_1 = 10, \bar{x}_1 = 312 MPa, s_1 = 14 MPa) and Alloy 2 (n_2 = 12, \bar{x}_2 = 298 MPa, s_2 = 18 MPa). Assuming equal population variances, test H_0: \mu_1 = \mu_2 against H_1: \mu_1 \neq \mu_2 at \alpha = 0.05. Use t_{0.025,\,20} = 2.086.


Exercise 5. A process engineer suspects that curing temperature x (°C) affects bond strength Y (N/mm²). Five runs are recorded:

x 120 130 140 150 160
Y 8.1 9.4 10.8 11.5 12.9
  1. Compute the OLS estimates \hat{\beta}_0 and \hat{\beta}_1.
  2. Compute R^2.
  3. Test H_0: \beta_1 = 0 at \alpha = 0.05, given t_{0.025,\,3} = 3.182.

Exercise 6. Three CNC machine settings produce the following daily output yields (units per shift):

Setting 1 Setting 2 Setting 3
102 98 115
105 101 112
99 97 118
103 100 114

Perform a one-way ANOVA F-test for equality of means at \alpha = 0.05. Use F_{0.05,\,2,\,9} = 4.26. State whether setting affects yield.