70  Nonlinear optimisation for design and operations

Tradeoffs, constraints, and the shape of a best solution

Make the structure lighter without making it fail. Increase throughput without violating a pressure limit. Improve prediction quality without blowing the compute budget. Once the world stops being linear, “the best answer” is no longer a corner point on a clean polygon. It is a shape in a curved space, and getting there becomes part geometry, part computation, part judgment.

This chapter gathers the themes that the earlier chapters developed separately. Simulation creates a model of system behaviour. Estimation tells you what is known and uncertain. Reliability tells you what risk costs you. Optimisation turns all of that into a decision rule.

When the objective and constraints are nonlinear in the design variables, the geometry changes, the algorithms change, and the meaning of “best” becomes more local and more conditional.


70.1 What this chapter helps you do

Symbols to keep handy

These are the bits of notation you'll see a lot. If a line of symbols feels like a fence, read it out loud once, then keep going.

  • (,): the Lagrangian

  • f: grad f — the direction of steepest local increase

  • _i: lambda sub i — a Lagrange multiplier

  • h() = 0: h of x equals zero — an equality constraint

  • g_i() : g sub i of x less than or equal to zero — an inequality constraint

  • f(): f of x bold — the objective function

Definitions to keep handy

These are the words we keep coming back to. If one feels slippery, come back here and steady it before you push on.

  • nonlinear optimisation: Finding the best design choice when the objective or constraints bend the space (they are not linear).

  • local optimum: A best answer in a neighbourhood, not necessarily the best anywhere.

  • gradient: The direction of steepest local increase; stepping against it decreases the objective fastest locally.

  • constraint: A hard limit the solution must satisfy.

  • Lagrange multiplier: A bookkeeping number that tells you how much the objective would improve if you relaxed a constraint slightly.

Here is the main move this chapter is making, in plain terms. You do not need to be fast. You just need to keep the thread.

  • Coming in: Real design problems are rarely linear, and the best answer depends on what counts as acceptable tradeoff.

  • Leaving with: Gradient-based and constrained nonlinear optimisation generalise the linear programming mindset to realistic engineering design spaces.

70.2 From linear to nonlinear thinking

In linear programming, the feasible set is a polytope and the optimum sits at a vertex. In nonlinear optimisation, neither statement is generally true.

You now have:

  • a nonlinear objective f(\mathbf{x})
  • nonlinear constraints g_i(\mathbf{x}) \leq 0
  • possibly many local optima instead of one clean global answer

The design variable vector \mathbf{x} may represent dimensions, control settings, schedule variables, material choices, or hyperparameters. The mathematics does not care what professional name the variables carry. It cares how objective and constraints bend the space.

The simplest unconstrained update rule is gradient descent:

Gradient descent, in words

\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha \nabla f(\mathbf{x}_k)

says: “look at the local slope, then take a step downhill.”

  • \nabla f points uphill (steepest increase).
  • -\nabla f points downhill (steepest local decrease).
  • \alpha controls how big a step you dare take.

\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha \nabla f(\mathbf{x}_k)

The negative gradient points in the direction of steepest local decrease. Choose a step size \alpha, move, and repeat.

In practice the difficulties are exactly the ones you would expect in a curved space: the descent direction is only locally good, the step size can be too large or too small, and constraints can make the locally best direction infeasible.

Convergence and local optima. Gradient descent converges to a local minimum, not necessarily a global one. For convex problems — where the feasible set is convex and the objective has no local minima that are not also global — any stationary point found is a global optimum and convergence guarantees are strong. For non-convex problems, gradient descent may stop at a local minimum or saddle point. The algorithm has no way to know whether a better basin exists elsewhere in the space. Global optimisation (basin-hopping, simulated annealing, branch-and-bound for mixed-integer problems) is a separate, harder task. Most real engineering design problems and essentially all deep-learning objectives are non-convex.

Step size. A fixed \alpha that is too large causes divergence; too small makes convergence impractically slow. In practice, the step size is set by a line search — a procedure that finds an \alpha satisfying a formal sufficient-decrease condition. The Armijo condition requires that the objective drop by at least a fixed fraction of what a full gradient step would give; the stronger Wolfe conditions additionally bound how much the gradient is allowed to change. Either condition ensures progress without committing to a fixed schedule. Alternatively, an adaptive schedule decreases \alpha over iterations. Automatic differentiation and adaptive-rate optimisers have made fixed-rate gradient descent rare in production systems.

NoteComputing note: batch, stochastic, and mini-batch gradient descent

The rule above evaluates the full gradient over all data at each step — this is batch gradient descent. In machine learning, where f is a loss summed over millions of examples, evaluating the full gradient per step is computationally prohibitive.

Stochastic gradient descent (SGD) in its strict form uses a single randomly chosen example per step. The gradient estimate is very noisy but each step is cheap. In practice, training uses mini-batch SGD: the gradient is estimated from a randomly sampled subset (a mini-batch, typically 32–512 examples), balancing noise against computational cost.

Modern optimisers — Adam, AdaGrad, RMSProp — extend mini-batch SGD by maintaining per-parameter step sizes. AdaGrad accumulates the squared gradients and divides by their square root, shrinking the effective step for parameters that receive large or frequent updates. RMSProp uses an exponential moving average of squared gradients rather than accumulating them, preventing the step size from decaying to zero. Adam combines an RMSProp-style second moment estimate with a first moment estimate (momentum), making it an adaptive gradient method with memory of both direction and curvature. All three are still gradient-based and inherit the local-minimum limitation of gradient descent.

The convergence theory for stochastic methods differs from the deterministic case (requires diminishing step-size schedules or noise conditions), but the geometric intuition is the same: move opposite to an estimate of the gradient, iterate, and check for convergence.

Use the point controls and the descent button to compare a free downhill move with a move that has to stay inside the feasible set.

70.3 Constrained optimisation and the Lagrangian idea

Suppose you want to minimise f(\mathbf{x}) subject to an equality constraint

h(\mathbf{x}) = 0

At an optimum, you usually cannot move in an arbitrary direction because many directions leave the feasible surface. The Lagrangian is

\mathcal{L}(\mathbf{x}, \lambda) = f(\mathbf{x}) + \lambda h(\mathbf{x})

The Lagrangian, in words

Think of \mathcal{L} as “objective + a constraint-enforcement term”.

  • f(\mathbf{x}) is the thing you are trying to make small (cost, error, mass, risk).
  • h(\mathbf{x}) = 0 is the rule you must obey.
  • \lambda is a balancing knob the maths introduces so that “best” and “allowed” can be true at the same time.

At the solution, \lambda is not a design choice you pick. It is the number that makes the optimality conditions work out, and it often has a real interpretation: how sensitive the objective is to relaxing the constraint.

The stationary conditions are

\nabla_{\mathbf{x}} \mathcal{L} = 0, \qquad h(\mathbf{x}) = 0

This means constrained optimality is a balance condition: at the solution, the objective cannot improve without pushing against the geometry of the constraint.

Geometrically, this means the gradient of the objective lines up with the gradient of the constraint. You cannot improve the objective without leaving feasibility.

For inequality constraints the full Karush-Kuhn-Tucker (KKT) conditions are needed. A constraint g_i(\mathbf{x}) \leq 0 is active at a point if g_i(\mathbf{x}) = 0 (the point sits on the constraint boundary) and inactive if the strict inequality holds (the constraint has slack). The KKT conditions extend the Lagrangian stationarity conditions by requiring that each active constraint contributes a non-negative multiplier and that inactive constraints contribute nothing (\lambda_i = 0). The intuition is the same as the equality case: at an optimum you cannot improve the objective without violating at least one binding constraint, so the gradients of objective and active constraints must be in balance.

70.4 Tradeoffs are the real subject

Nonlinear optimisation is a tradeoff chapter, not just an algorithm chapter.

If you minimise mass, stress may rise. If you maximise throughput, queueing risk may rise. If you improve prediction accuracy, compute cost may rise. The objective function is therefore not neutral. It encodes what you are willing to care about and what you are willing to pay.

That is why sensitivity analysis matters. A computed optimum is only useful if you can explain how it shifts when assumptions, weights, or constraints change.

NoteA self-check

Before running any solver, you should be able to answer three questions about your formulation: (1) What does the objective reward or penalise? (2) Are the constraints physically motivated, or are some just computational convenience? (3) If the optimum shifts when you tighten a constraint by 10%, does that shift make physical sense?

If you cannot answer all three, the formulation is not ready.

70.5 The core method

A first pass through a nonlinear optimisation problem usually goes like this:

  1. Choose the design variables.
  2. Write the objective in a way that matches the actual decision.
  3. Write the constraints in physically meaningful form.
  4. Inspect the geometry qualitatively before trusting an algorithm.
  5. Use a numerical method appropriate to smoothness and constraint structure.
  6. Check whether the result is feasible, local or global, and sensitive to modelling assumptions.

This is where all the earlier Volume 8 chapters return. If the model, estimate, or risk terms are poor, the optimisation result will be polished nonsense.

70.6 Worked example 1: unconstrained quadratic design objective

Minimise

f(x) = x^2 - 4x + 7

The derivative is

f'(x) = 2x - 4

Set it to zero:

2x - 4 = 0 \qquad \Rightarrow \qquad x = 2

The second derivative is

f''(x) = 2 > 0

so x=2 is a local minimum. The minimum value is

f(2) = 4 - 8 + 7 = 3

The core logic is visible here in its simplest form: stationary point, then curvature check.

70.7 Worked example 2: equality-constrained optimisation

Minimise

f(x,y) = x^2 + y^2

subject to

x + y - 1 = 0

Form the Lagrangian:

\mathcal{L}(x,y,\lambda) = x^2 + y^2 + \lambda(x+y-1)

Take derivatives:

\frac{\partial \mathcal{L}}{\partial x} = 2x + \lambda = 0 \frac{\partial \mathcal{L}}{\partial y} = 2y + \lambda = 0 x+y-1 = 0

The first two equations each give \lambda = -2x and \lambda = -2y, so

x = y

Then the constraint gives

2x = 1 \qquad \Rightarrow \qquad x = y = \frac{1}{2}

So the feasible point closest to the origin is

\left(\frac{1}{2}, \frac{1}{2}\right)

This is the simplest geometric picture of constrained optimisation: the objective’s level sets touch the feasible set at the best allowable point.

70.8 Worked example 3: design under a penalty tradeoff

Suppose an engineer uses the objective

J(x) = (x-3)^2 + 0.5x^4

One term rewards staying near a target value x=3; the quartic term penalises large magnitude. Real design objectives often mix performance and regularisation in exactly this way.

Different penalty weights produce different optima. That means the “best” design is partly a statement about priorities, not just about calculus.

In machine learning the same structure recurs. Ridge regression (L2 regularisation) adds a penalty \lambda\|\mathbf{w}\|^2 to the loss function, trading fit quality for smaller parameter magnitudes. The penalty weight \lambda plays the same role as the 0.5 here: it encodes how strongly the solver is pushed away from large solutions, and changing it shifts the optimum. Lasso (L1 regularisation) uses \lambda\|\mathbf{w}\|_1 and produces sparse solutions — many weights driven to exactly zero — for a geometric reason: the L1 ball has corners on the coordinate axes, and when the objective’s level sets first touch the constraint boundary they typically land at a corner. Note that the L1 norm is not differentiable at zero, so the standard Lagrangian gradient condition does not directly apply there; the analysis uses subgradients, which generalise gradients to non-smooth functions, but the physical intuition (level set touching constraint boundary at a corner) holds without that machinery. Portfolio risk penalties and operations scheduling with nonlinear costs follow the same two-term structure.

70.9 Where this goes

There is no single next topic from here. The continuation is the practice of using the whole Volume 8 stack together: models, estimates, risks, and objectives inside one decision loop.

The mathematics is no longer a supporting subject. It is the working medium of design judgment.

TipApplications
  • shape and parameter optimisation in engineering design
  • calibration and tuning of controllers
  • operations and scheduling under nonlinear costs
  • portfolio and risk optimisation
  • hyperparameter and training-objective tuning in ML systems
  • multi-objective design under uncertainty

70.10 Exercises

These are project-style exercises. Always explain what “better” means in the problem, not only where the derivative vanishes.

70.10.1 Exercise 1

Minimise

f(x) = x^2 - 6x + 10

Find the minimiser and the minimum value.

70.10.2 Exercise 2

Use Lagrange multipliers to minimise

f(x,y) = x^2 + 4y^2

subject to

x + y = 3

70.10.3 Exercise 3

An engineering team is choosing a design variable x to minimise

J(x) = (x-2)^2 + 0.2x^4

Write a short interpretation note answering:

  1. what each term in the objective is doing
  2. why this is not a linear optimisation problem
  3. why changing the penalty weight 0.2 would change the preferred design

70.10.4 Exercise 4

Choose one optimisation problem from your own field and prepare a one-page design brief naming:

  1. the design variables
  2. the objective
  3. the constraints
  4. one source of nonlinearity
  5. one uncertainty or reliability issue that should influence the formulation
  6. one reason a computed optimum might still be a bad decision

Use the weight slider to change what the objective values most, then watch the best feasible design move along the curved constraints.