69  Reliability, stochastic systems, and quality

Failure, variability, and risk over time

One bearing fails early, another lasts for years. A queue stays calm all morning, then suddenly grows faster than it clears. A production line meets tolerance most days but not all days. Randomness is not a side issue here. It changes maintenance, staffing, safety, and design.

Engineering decisions are made before the exact future is known. You do not know which unit will fail next, exactly when a line will back up, or how many defects will appear in the next batch. What you can do is model the variability honestly and let that model shape action.

Three linked ideas run through this chapter: lifetime, queueing, and quality. They look like separate professional topics. They are all ways of asking how randomness accumulates into operational consequence.


69.1 What this chapter helps you do

Symbols to keep handy

These are the bits of notation you'll see a lot. If a line of symbols feels like a fence, read it out loud once, then keep going.

  • (t): lambda of t — the hazard or failure rate; equals the constant lambda for the exponential model

  • R(t) = P(T > t): R of t — the reliability function

  • T: capital T — a random lifetime

  • L_q: L sub q — expected queue length

  • = _a / : rho — traffic intensity; ratio of arrival rate to service rate

  • _a: lambda sub a — the arrival rate in a queueing system

  • , : beta and eta — Weibull shape and scale parameters

  • F(t) = P(T t): F of t — the failure distribution function

Definitions to keep handy

These are the words we keep coming back to. If one feels slippery, come back here and steady it before you push on.

  • reliability: How likely something is to keep working over time.

  • hazard rate: The instantaneous failure tendency, given survival so far.

  • Weibull model: A flexible lifetime model that can capture early failures, random failures, or wear-out.

  • queue: A waiting line created when arrivals sometimes outpace service.

  • process capability: A way to compare typical variability to the allowed tolerance band.

Here is the main move this chapter is making, in plain terms. You do not need to be fast. You just need to keep the thread.

  • Coming in: Variability is not noise to be ignored. In engineering it changes safety, maintenance, and decision quality.

  • Leaving with: Stochastic models connect randomness to lifetime, throughput, defects, queues, and operational risk.

69.2 Reliability as survival over time

Let T be a random lifetime. The reliability function is

Reliability, in words

R(t)=P(T>t)

means: “what is the chance the thing is still working after time t?”

It is a survival curve. At t=0 it is 1. As time increases, it falls toward 0.

R(t) = P(T > t)

This is the probability that the component or system survives beyond time t.

The failure distribution is

F(t) = P(T \leq t) = 1 - R(t)

If the lifetime is exponentially distributed with constant hazard rate \lambda, then

R(t) = e^{-\lambda t}

and

F(t) = 1 - e^{-\lambda t}

This is the most common first model: simple, and with a clear interpretation — the system has a constant failure tendency per unit time.

The hazard rate \lambda(t) describes the instantaneous failure tendency conditional on survival up to time t. Let f(t) = F'(t) be the probability density of the lifetime. Since R(t) = 1 - F(t), differentiating gives R'(t) = -f(t). The hazard rate is then the density of failure conditional on having survived so far:

Hazard rate, in words

The hazard rate is not “probability of failure.” It is the instantaneous failure tendency right now, given that you have made it to time t.

It answers: “If it has survived so far, how risky is the next moment?”

\lambda(t) = \frac{f(t)}{R(t)} = \frac{-R'(t)}{R(t)}

For the exponential model \lambda(t) = \lambda (constant), which is why the scalar \lambda appears in both the survival function and the hazard. This means hazard is about conditional risk right now, given survival so far, not about total failure probability over the whole lifetime.

69.2.1 Weibull model for non-constant hazard

The exponential model’s constant hazard is a special case that real components often do not satisfy. The Weibull distribution is the standard extension:

R(t) = e^{-(t/\eta)^\beta}

where \eta > 0 is the scale (characteristic life) and \beta > 0 is the shape parameter. The hazard rate is

\lambda(t) = \frac{\beta}{\eta}\left(\frac{t}{\eta}\right)^{\beta - 1}

When \beta = 1 the Weibull reduces to the exponential. When \beta > 1 the hazard increases with age — wear-out behaviour typical of fatigue, corrosion, and mechanical degradation. When \beta < 1 the hazard decreases — early-life failures (infant mortality). Reliability engineers fit \beta from field data and use the value to decide whether early screening, scheduled replacement, or condition monitoring is the appropriate response.

In some systems hazard is approximately constant over the operational window of interest and the exponential model is adequate. In others the shape parameter matters for every maintenance and safety decision.

The explorer below links survival, cumulative failure, and hazard at the same time marker.

69.3 Queueing as accumulation under uncertainty

Instead of asking when a unit fails, ask what happens when random arrivals meet random service.

In a queueing system:

  • arrivals are uncertain in time
  • service completions are uncertain in time
  • if arrivals temporarily outpace service, a queue forms

Two core quantities are:

L_q = \text{expected number waiting in queue}

W_q = \text{expected waiting time in queue}

These are connected by Little’s Law:

L_q = \lambda_a W_q

The expected queue length equals the arrival rate times the expected waiting time. Little’s Law holds under mild conditions regardless of the arrival or service distribution — it is a conservation statement about flow rather than an assumption about randomness.

For the system to reach a steady state, arrivals must not permanently outpace service: the condition is \rho < 1. If \rho \geq 1 the queue grows without bound — the system is unstable and no finite long-run average exists. When \rho < 1 but is close to 1, the queue length grows sharply, even when mean demand still looks safely below capacity.

Traffic intensity \rho, in words

In the simplest queueing models,

\rho = \frac{\lambda_a}{\mu}

where:

  • \lambda_a is the average arrival rate (jobs per hour, customers per minute, packets per second)
  • \mu is the average service rate (how many jobs per hour one server can finish)

So \rho is a utilisation number. If \rho = 0.8, the server is busy about 80% of the time on average. The trouble is that “80% busy on average” can still mean long waits when variability and bursts are present.

Variability matters even when average values look harmless. A system with average arrival rate only slightly below service rate can still produce long queues if the variability is large enough.

For the simplest tractable case — Poisson arrivals, exponential service, one server (M/M/1) — the expected queue length has a closed form:

L_q = \frac{\rho^2}{1 - \rho}

This grows steeply as \rho \to 1. The worked example below shows the numbers.

This is why service reliability in computing and waiting-time models in operations research belong in the same chapter as bearing lifetime and defect rates. They are all stochastic consequences of uncertain events over time.

69.4 Quality as probability of staying in tolerance

A quality problem can often be phrased as:

  • what is the probability that a product characteristic falls outside the acceptable region?

If a machined diameter is modelled as approximately normal with mean \mu and standard deviation \sigma, then defect probability is the area in the tails beyond the specification limits.

In manufacturing engineering this is operationalised through the process capability index. The most common form is

C_p = \frac{\text{USL} - \text{LSL}}{6\sigma}

where USL (upper specification limit) and LSL (lower specification limit) are the boundaries of acceptable product and 6\sigma is the spread of the process distribution. A value of C_p \geq 1.33 is a common minimum standard in production; it means the process spread fits inside the tolerance window with margin. The one-sided version C_{pk} accounts for centring:

C_{pk} = \min\!\left(\frac{\text{USL} - \mu}{3\sigma},\; \frac{\mu - \text{LSL}}{3\sigma}\right)

A process can have adequate spread (C_p high) but still produce defects if the mean is off-centre (C_{pk} low). Capability indices are the bridge between the tail-area probability from the normal model and the tolerance decisions made in a production environment.

This is not only a manufacturing question. The same logic appears in service level agreements, anomaly thresholds, and risk tolerances in data systems.

NoteWhy this structure works

Reliability, queueing, and quality are all asking the same underlying question: given a random process, what is the probability that a threshold is crossed? For reliability the threshold is failure; for queueing it is demand exceeding capacity; for quality it is a measurement leaving the specification band.

That shared structure means the same probabilistic tools — tail probabilities, expected values, densities — apply across all three, and a result in one domain often has a direct analogue in another.

69.5 The core method

A first pass through a stochastic reliability or quality problem usually goes like this:

  1. Identify the random quantity: lifetime, arrivals, service time, or process variation.
  2. Choose the summary function that matches the decision: reliability, hazard, queue length, waiting time, or defect probability.
  3. Compute the key probability or expected value.
  4. Interpret it in operational terms.
  5. Ask whether the assumed stochastic model is realistic enough for the decision you are about to make.

That final question is essential. A neat model with the wrong variability structure can be more dangerous than admitting uncertainty openly.

69.6 Worked example 1: exponential reliability

Suppose a component has constant failure rate

\lambda = 0.002 \text{ h}^{-1}

Then the reliability at 100 hours is

R(100) = e^{-0.002(100)} = e^{-0.2} \approx 0.819

So there is about an 81.9% chance the component survives beyond 100 hours. Equivalently, the probability of failure by 100 hours is

F(100) = 1 - 0.819 = 0.181

This is already enough to frame a maintenance question. If a system contains many such components, or if failure consequence is high, 18.1% may be too risky to tolerate over that interval.

69.7 Worked example 2: a queue near saturation

A service system receives jobs at average rate

\lambda_a = 8 \text{ jobs/h}

and can serve at average rate

\mu = 10 \text{ jobs/h}

The traffic intensity is

\rho = \frac{\lambda_a}{\mu} = 0.8

For an M/M/1 queue (Poisson arrivals, exponential service, single server), the expected number waiting in queue is

L_q = \frac{\rho^2}{1 - \rho} = \frac{0.64}{0.2} = 3.2 \text{ jobs}

By Little’s Law the expected waiting time in queue is

W_q = \frac{L_q}{\lambda_a} = \frac{3.2}{8} = 0.4 \text{ h}

So at 80% utilisation, a job waits on average 24 minutes before being served. If utilisation rises to \rho = 0.9, L_q = 8.1 jobs and W_q = 1.01 h — more than twice as long for a 12.5% increase in load. That nonlinear blowup is the engineering message.

Systems that look adequate on average can still feel overloaded because variability is experienced in real time, not as a long-run mean.

69.8 Worked example 3: defect probability from a normal model

Suppose a manufactured diameter is approximately normal with mean

\mu = 10.0 \text{ mm}

and standard deviation

\sigma = 0.1 \text{ mm}

The upper specification limit is 10.2 mm. Standardise:

z = \frac{10.2 - 10.0}{0.1} = 2

So the probability of exceeding the upper limit is about

P(Z > 2) \approx 0.0228

or 2.28%.

That tail probability is the mathematical link between observed variation and quality cost. It is also the same logic used in anomaly detection and service thresholds outside manufacturing.

69.9 Where this goes

The next continuation is Nonlinear optimisation for design and operations. Once risk, waiting, and failure are quantified, design decisions become tradeoffs between cost, performance, safety, and uncertainty.

TipApplications
  • maintenance interval planning
  • reliability engineering and survival analysis (exponential and Weibull models)
  • queueing in production, logistics, and computing (M/M/1 and related models)
  • defect rates and process capability (C_p, C_{pk})
  • service availability and incident risk
  • uncertainty-aware operations decisions

69.10 Exercises

These are project-style exercises. State the operational meaning of the number you compute.

69.10.1 Exercise 1

A component has exponential lifetime with failure rate

\lambda = 0.005 \text{ h}^{-1}

Compute the probability it survives beyond 50 hours.

69.10.2 Exercise 2

A queueing system has average arrival rate 12 jobs/h and service rate 15 jobs/h.

  1. Compute the traffic intensity \rho.
  2. Explain whether the system has generous slack or operates near saturation.
  3. Suppose demand grows to 14 jobs/h without any change in service rate. What happens to \rho, and at what arrival rate does the system become dangerously loaded (say, \rho > 0.95)? What operational decision does this suggest?

69.10.3 Exercise 3

A process characteristic is approximately normal with mean 25 and standard deviation 2. The upper specification limit is 29.

Compute the corresponding z value and explain what tail probability it refers to.

69.10.4 Exercise 4

Choose one stochastic setting from your field and prepare a one-page model brief naming:

  1. the random quantity
  2. the decision you need to support
  3. the probability or expectation that matters
  4. one source of model mismatch
  5. one action you would change if the risk turned out to be higher than expected