SLOs and Error Budgets: A Practical Guide for Performance Engineers
How to turn vague reliability goals into measurable SLIs, SLOs, and error budgets — and how that math directly governs release velocity and on-call load.
“The site should be fast and reliable” is not a target you can act on. SRE turns that wish into arithmetic: SLIs measure behavior, SLOs set the bar, and the error budget is what’s left over to spend. Get this right and most reliability arguments resolve themselves.
The three terms
- SLI (Service Level Indicator) — a metric that reflects user experience. Good ones are ratios of good events to total events: successful requests / total requests, or requests faster than 300 ms / total requests.
- SLO (Service Level Objective) — the target for an SLI over a window. e.g. 99.9% of requests succeed over 28 days.
- Error budget —
1 − SLO. The amount of unreliability you’re allowed to spend.
The error-budget math
A 99.9% availability SLO over 28 days gives you:
budget = (1 − 0.999) × 28 days
= 0.001 × 40,320 minutes
≈ 40.3 minutes of allowed downtime
That 40 minutes is a budget, not a failure. If you’ve spent only 5 minutes this month, you have headroom to ship risky changes. If you’ve burned 38, you freeze features and spend the remaining budget on hardening.
| SLO | Allowed downtime / 28 days | Allowed / 30 days |
|---|---|---|
| 99% | ~6.7 hours | ~7.2 hours |
| 99.9% | ~40 min | ~43 min |
| 99.95% | ~20 min | ~22 min |
| 99.99% | ~4 min | ~4.3 min |
Every extra nine roughly multiplies cost and operational toil. Pick the lowest SLO your users will tolerate, not the highest you can imagine.
Latency SLOs, not just availability
Availability alone hides slowness. A request that takes 12 seconds “succeeded” but the user left. Define a latency SLO as a ratio:
SLI = count(requests faster than threshold) / count(all requests)
SLO = 99% of requests complete within 300 ms over 28 days
Always reason in percentiles, never averages — the mean is dominated by the happy path and hides the tail where users actually suffer.
Burn rate: the alert that matters
Don’t alert on raw error rate; alert on burn rate — how fast you’re consuming the budget relative to the window.
burn_rate = (errors_in_window / requests_in_window) / (1 − SLO)
A burn rate of 1 means you’ll exactly exhaust the budget by the window’s end. A burn rate of 14.4 means you’ll burn a 30-day budget in ~2 days — page someone now. Multi-window, multi-burn-rate alerts (fast burn + slow burn) give you urgency without noise.
How budgets govern velocity
The policy is the point:
- Budget remaining → ship features, take risks, run experiments.
- Budget exhausted → feature freeze; only reliability work until the window resets.
This depersonalizes the eternal dev-vs-ops fight. The number decides, not the loudest voice in the incident channel.
Where to start
Instrument one user-facing SLI, set a deliberately loose SLO, watch it for a month, then tighten. An SLO you actually measure beats a perfect one you only aspire to. Try the math yourself with the upcoming error-budget calculator.
Comments are powered by Giscus (GitHub Discussions). Enable them by
configuring GISCUS in src/consts.ts — see
giscus.app.