SLOs and Error Budgets: A Practical Guide for Performance Engineers

How to turn vague reliability goals into measurable SLIs, SLOs, and error budgets — and how that math directly governs release velocity and on-call load.

· By perf-test.com Editorial · AI-assisted
sloslierror-budgetreliability

“The site should be fast and reliable” is not a target you can act on. SRE turns that wish into arithmetic: SLIs measure behavior, SLOs set the bar, and the error budget is what’s left over to spend. Get this right and most reliability arguments resolve themselves.

The three terms

  • SLI (Service Level Indicator) — a metric that reflects user experience. Good ones are ratios of good events to total events: successful requests / total requests, or requests faster than 300 ms / total requests.
  • SLO (Service Level Objective) — the target for an SLI over a window. e.g. 99.9% of requests succeed over 28 days.
  • Error budget1 − SLO. The amount of unreliability you’re allowed to spend.

The error-budget math

A 99.9% availability SLO over 28 days gives you:

budget = (1 − 0.999) × 28 days
       = 0.001 × 40,320 minutes
       ≈ 40.3 minutes of allowed downtime

That 40 minutes is a budget, not a failure. If you’ve spent only 5 minutes this month, you have headroom to ship risky changes. If you’ve burned 38, you freeze features and spend the remaining budget on hardening.

SLOAllowed downtime / 28 daysAllowed / 30 days
99%~6.7 hours~7.2 hours
99.9%~40 min~43 min
99.95%~20 min~22 min
99.99%~4 min~4.3 min

Every extra nine roughly multiplies cost and operational toil. Pick the lowest SLO your users will tolerate, not the highest you can imagine.

Latency SLOs, not just availability

Availability alone hides slowness. A request that takes 12 seconds “succeeded” but the user left. Define a latency SLO as a ratio:

SLI = count(requests faster than threshold) / count(all requests)
SLO = 99% of requests complete within 300 ms over 28 days

Always reason in percentiles, never averages — the mean is dominated by the happy path and hides the tail where users actually suffer.

Burn rate: the alert that matters

Don’t alert on raw error rate; alert on burn rate — how fast you’re consuming the budget relative to the window.

burn_rate = (errors_in_window / requests_in_window) / (1 − SLO)

A burn rate of 1 means you’ll exactly exhaust the budget by the window’s end. A burn rate of 14.4 means you’ll burn a 30-day budget in ~2 days — page someone now. Multi-window, multi-burn-rate alerts (fast burn + slow burn) give you urgency without noise.

How budgets govern velocity

The policy is the point:

  1. Budget remaining → ship features, take risks, run experiments.
  2. Budget exhausted → feature freeze; only reliability work until the window resets.

This depersonalizes the eternal dev-vs-ops fight. The number decides, not the loudest voice in the incident channel.

Where to start

Instrument one user-facing SLI, set a deliberately loose SLO, watch it for a month, then tighten. An SLO you actually measure beats a perfect one you only aspire to. Try the math yourself with the upcoming error-budget calculator.

Discussions coming soon.

Comments are powered by Giscus (GitHub Discussions). Enable them by configuring GISCUS in src/consts.ts — see giscus.app.