The RED Method: Rate, Errors, Duration for Service Monitoring
How the RED method gives a simple, consistent framework for monitoring any request-driven service, and how it complements the USE method.
The RED method, popularized by Weaveworks for monitoring microservices, proposes that almost any request-driven service can be adequately monitored with just three metrics per service: Rate, Errors, and Duration — a deliberately minimal, consistent framework that’s easy to apply uniformly across many services.
Rate
Requests per second the service is handling — the basic throughput signal, useful both for capacity awareness and as context for interpreting the other two metrics (an error rate of 1% means something very different at 10 requests/second versus 10,000 requests/second in absolute impact).
Errors
The rate (or percentage) of requests resulting in an error — directly maps to the error-budget-relevant SLI covered in this site’s SLO article. Defining what counts as an “error” consistently (5xx responses, specific application-level failure indicators, or both) matters for this metric to be meaningful and comparable across services.
Duration
Request latency, ideally measured as a distribution (so you can extract percentiles) rather than just an average — the same percentiles-over-averages principle emphasized throughout this site applies directly here; a RED dashboard showing only average duration is missing the metric’s most important dimension.
Why exactly these three, and not more
The RED method’s appeal is specifically its minimalism — three consistent metrics, instrumented the same way across every service, make it trivial to build a uniform dashboard template applicable to any new service without bespoke design effort each time. This consistency has real organizational value: an engineer unfamiliar with a specific service can still understand its basic health at a glance, because every service’s dashboard follows the same RED structure.
What RED deliberately doesn’t cover
RED is specifically about request-driven services from the outside (what a client experiences) — it doesn’t cover internal resource utilization (CPU, memory, disk), which is what the complementary USE method (covered in this site’s dedicated article) addresses. The two are meant to be used together: RED for service-level, client-facing health; USE for the underlying resource health that often explains why RED metrics are degrading.
Implementing RED with Prometheus and Grafana
RED metrics map naturally onto Prometheus’s metric types covered in this site’s Prometheus/Grafana article: Rate from a counter (rate(requests_total[5m])), Errors from a counter filtered to error status codes, and Duration from a histogram queried via histogram_quantile() — a standard, well-trodden implementation pattern with widely available dashboard templates (e.g. via Grafana’s dashboard library) rather than something you need to design entirely from scratch.
RED as the basis for SLOs
The Errors and Duration metrics map almost directly onto the SLI definitions covered in this site’s SLO article — a service instrumented with RED already has the raw data needed to define availability and latency SLOs without additional instrumentation work, making RED a practical, low-effort starting point for organizations that haven’t yet formalized SLOs but want a consistent baseline first.
Takeaway: the RED method’s value is in its deliberate minimalism and consistency — three metrics, instrumented the same way everywhere, that together cover what a client experiences and provide the raw data most SLOs are built from.
Comments are powered by Giscus (GitHub Discussions). Enable them by
configuring GISCUS in src/consts.ts — see
giscus.app.