How to Calculate an Error Budget, Step by Step
A step-by-step walkthrough of calculating an error budget from an SLO, with worked examples at different reliability targets.
This site’s SLO article covers why error budgets matter for balancing reliability against release velocity; this one walks through the actual calculation mechanics step by step, since getting the arithmetic right (and knowing what it implies) is where the practical value lives.
Step 1: define the SLO and measurement window
An SLO is a target percentage over a defined window — for example, “99.9% of requests succeed, measured over a rolling 28-day window.” The window length matters: shorter windows make the budget reset more often (faster recovery from a bad period, but also less forgiving of any single bad day), while longer windows smooth out short-term noise but mean a bad period’s consequences (a feature freeze, for instance) linger longer.
Step 2: calculate total budget for the window
error_budget = (1 - SLO) × total_time_in_window
For a 99.9% SLO over 28 days: (1 - 0.999) × 28 days = 0.001 × 40,320 minutes ≈ 40.3 minutes of allowed downtime (or allowed “bad” time, for non-availability SLIs) for the entire window.
Step 3: calculate budget consumed so far
For availability SLOs, this is straightforward: sum the actual downtime (or, more precisely, time during which the SLI condition wasn’t met) elapsed so far within the current window. For request-based SLIs (like “99% of requests succeed”), it’s calculated from the actual ratio of bad-to-total requests observed so far in the window, converted into an equivalent “budget consumed” fraction.
Step 4: calculate remaining budget and burn rate
remaining_budget = total_budget - consumed_budget
burn_rate = (current_period_bad_event_rate) / (1 - SLO)
A burn rate of 1.0 means you’re consuming budget at exactly the rate that would exhaust it right at the window’s end if sustained. A burn rate of 10 means you’d exhaust a 28-day budget in about 2.8 days if that rate continued — this is the number that should actually drive alerting urgency, as covered in this site’s SLO article’s discussion of multi-window burn rate alerts.
Worked example at multiple SLO levels
| SLO | Budget per 28 days | Budget per 30 days |
|---|---|---|
| 99% | ~6.7 hours | ~7.2 hours |
| 99.9% | ~40.3 min | ~43.2 min |
| 99.95% | ~20.2 min | ~21.6 min |
| 99.99% | ~4.0 min | ~4.3 min |
Each additional “nine” of reliability shrinks the allowed budget by roughly an order of magnitude — which is exactly why the operational cost (tooling, on-call rigor, architectural redundancy) of each additional nine tends to grow disproportionately too, a point worth making explicitly when stakeholders propose a stricter SLO without appreciating what it actually costs to sustain.
Applying this to request-based, not just time-based, SLIs
For an SLI like “percentage of requests completing within 300ms,” the same budget math applies, just denominated in request counts rather than clock time: error_budget_requests = (1 - SLO) × total_requests_in_window. The conversion between time-based and request-based framing matters for correctly comparing budget consumption against the right denominator.
A practical takeaway for tooling
This site’s error-budget calculator (in development, see the tools page for current calculators) automates exactly this calculation — the value of doing it by hand at least once, as walked through here, is building the intuition for why each additional nine of reliability costs disproportionately more budget headroom to sustain, which a calculator alone won’t necessarily convey.
Takeaway: the error budget calculation itself is simple arithmetic — the real skill is in choosing an appropriate measurement window and then using burn rate, not just raw remaining budget, to drive both alerting urgency and ship/freeze policy decisions.
Comments are powered by Giscus (GitHub Discussions). Enable them by
configuring GISCUS in src/consts.ts — see
giscus.app.