Toil Reduction: Identifying and Eliminating Operational Toil

What SRE means by 'toil,' how to identify it systematically, and a practical framework for deciding what to automate first.

· By perf-test.com Editorial · AI-assisted
toilautomationsre

In SRE terminology, toil is operational work that is manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly with service growth — not just “work you don’t enjoy,” but a specific category with a specific fix: automate it away, or stop doing it.

The defining characteristics

Google’s SRE books define toil by several traits, and work needs most of them to really count: it’s manual (a human has to do it directly), repetitive (the same actions recur), automatable (a machine could do it, even if none currently does), tactical (reactive, not strategic), has no enduring value (completing it doesn’t make the system permanently better), and scales linearly with service size or traffic (more users/services means proportionally more toil, not a fixed cost).

Why the “scales linearly” trait matters most for prioritization

Toil that scales with growth becomes a growing tax over time even if it’s individually small today — a manual provisioning step that takes 10 minutes per new service is fine at 5 services and crushing at 200. Engineering work that doesn’t scale this way (a one-time migration, a fixed architecture decision) isn’t toil in this specific sense, even if it’s tedious — the distinction matters because it changes the urgency of automating it.

A practical toil audit

Track time spent on different categories of work for a few weeks (many teams use a simple tagging system on tickets/tasks) and look specifically for: manual deploy steps, manual scaling/capacity adjustments, repetitive incident-response actions that are always the same regardless of the specific incident, and recurring manual data fixes or backfills. The goal isn’t perfect measurement — it’s surfacing the highest-volume recurring categories worth automating first.

Setting an explicit toil budget

Some SRE teams explicitly cap toil at a target percentage of total work time (a commonly cited reference point is roughly 50%, though the right number depends on team and context) — making toil visible as a tracked metric, rather than an invisible background tax, is what creates organizational pressure to actually invest in automating it rather than just tolerating it indefinitely.

The trap of automating the wrong thing first

Automating a rare, low-volume task feels productive but doesn’t move the needle — prioritize by total time cost (frequency × duration), not by which task is most annoying in the moment or easiest to automate. A toil audit’s main value is making this prioritization based on data rather than recency-biased gut feel about whatever went wrong most recently.

Automation has its own maintenance cost

Automating toil away doesn’t make the underlying work cost zero — it converts recurring manual cost into a one-time build cost plus an ongoing (usually much smaller) maintenance cost for the automation itself. Be honest about this when justifying automation investment: the payoff is real, but it’s not literally free forever.

Connecting toil reduction to on-call health

Toil and on-call burnout (covered elsewhere on this site) are closely linked — a large share of painful on-call experience is often exactly the toil categories a proper audit would surface (manual repetitive incident actions, manual scaling) — reducing toil is frequently one of the highest-leverage on-call quality-of-life investments available.

Takeaway: toil is a specific, definable category of work — manual, repetitive, automatable, tactical, growth-scaling — and treating it as a tracked metric with an explicit budget, rather than an ambient background annoyance, is what actually drives sustained investment in eliminating it.

Discussions coming soon.

Comments are powered by Giscus (GitHub Discussions). Enable them by configuring GISCUS in src/consts.ts — see giscus.app.