Chaos Engineering: Testing Reliability by Breaking Things on Purpose

What chaos engineering is, how to run a safe first experiment, and how it connects to error budgets and SLOs.

· By perf-test.com Editorial · AI-assisted
chaos-engineeringsrereliability

Chaos engineering is the practice of deliberately injecting failure into a system — killing a process, adding network latency, exhausting a resource — to verify that the system actually handles it the way you assume it does, rather than waiting to find out during a real incident.

Why “assume” is the operative word

Most architecture diagrams show retries, failovers, and circuit breakers as boxes and arrows that imply correctness. Whether they actually work under real failure conditions is an empirical question, not something a diagram can answer. Chaos engineering turns that assumption into a tested fact.

Forming a hypothesis, not just causing chaos

A good experiment starts with a falsifiable hypothesis: “if we kill one instance of service X, p99 latency for endpoint Y stays under 500ms because traffic fails over to remaining instances within 2 seconds.” This framing forces you to define what “working” means before the experiment, so the result is unambiguous rather than a vague “seemed okay.”

Blast radius: start small

The first rule of running experiments safely is bounding the blast radius — start in a non-production environment, or in production but against a tiny percentage of traffic / a single non-critical instance, with a clear, fast way to abort if things go worse than expected. Expanding scope (more traffic, more critical services, eventually production-wide) happens gradually as confidence builds, not on day one.

Common experiment types

  • Resource exhaustion — CPU, memory, or disk pressure on an instance.
  • Network faults — added latency, packet loss, or full network partition between services.
  • Instance/process termination — killing a service instance to test failover.
  • Dependency failure — simulating a downstream API or database being unavailable or slow.
  • Clock skew — testing behavior when system clocks drift, relevant for distributed systems relying on time synchronization.

Game days: the team-level version

Beyond automated experiments, a game day is a scheduled, deliberate exercise where a team simulates an incident (sometimes for real, via an actual injected fault; sometimes as a tabletop exercise) to practice incident response — testing not just the system’s resilience but the team’s runbooks, communication, and decision-making under pressure, which automated chaos tooling alone doesn’t exercise.

Connecting to error budgets

Chaos experiments should be planned with the same error-budget thinking covered in this site’s SLO article — running a risky experiment when you’ve already burned most of your error budget for the period is poor judgment; running it when budget is healthy, and treating any budget consumed by the experiment itself as a deliberate, tracked spend, is the disciplined version of the practice.

Tooling

Purpose-built chaos engineering tools (Gremlin, Chaos Mesh, AWS Fault Injection Simulator, and others) provide safer, more controllable fault injection than ad hoc scripts — particularly valuable for the blast-radius controls and automatic abort conditions a hand-rolled script is unlikely to implement as carefully.

Takeaway: chaos engineering isn’t about causing chaos for its own sake — it’s about replacing an assumption (“failover works”) with a tested fact, starting small enough that a wrong assumption doesn’t become a real incident while you’re finding out.

Discussions coming soon.

Comments are powered by Giscus (GitHub Discussions). Enable them by configuring GISCUS in src/consts.ts — see giscus.app.