Monitoring vs Observability: A Practical Distinction
What actually separates monitoring from observability beyond the buzzword, and why the distinction matters for debugging unknown failure modes.
“Observability” became a popular replacement term for “monitoring” largely through marketing, which has made the genuine underlying distinction harder to see — but there is a real, useful difference worth understanding, not just a rebrand.
Monitoring: answering known questions
Monitoring, in the traditional sense, is built around predefined metrics and dashboards answering questions you already knew to ask: CPU usage, request rate, error rate, a specific business metric. It’s excellent for known failure modes — you set a threshold, you get paged when it’s crossed. Its limitation is exactly that it requires you to have anticipated the question in advance; an entirely novel failure mode that doesn’t trip any predefined metric can go undetected by monitoring alone.
Observability: answering questions you didn’t anticipate
Observability, as the term is used meaningfully (versus as marketing), refers to a system’s property of being explorable after the fact — having enough rich, high-cardinality, high-dimensionality data (detailed traces, structured logs with many attributes) that you can ask a new question you didn’t anticipate when the system was instrumented, and actually get an answer, without needing to ship new code and wait for the next occurrence.
The cardinality test
A practical way to tell whether you have real observability versus just monitoring: can you ask “show me all requests from customer X, on app version Y, that hit error Z, broken down by which database shard they used” — a highly specific, multi-dimensional, ad hoc query — and get an answer from your existing instrumentation? If your tooling can only answer pre-aggregated questions (average latency by endpoint, error count by service) and can’t slice arbitrarily by high-cardinality dimensions like customer ID or individual request ID, you have monitoring, not observability, regardless of what the tool’s marketing calls itself.
The three pillars, and why they’re not equally important
The commonly cited “three pillars” — metrics, logs, and traces — get treated as equally weighted, but distributed traces with rich, high-cardinality attributes are usually what actually provides the “ask a question you didn’t anticipate” capability; metrics are typically pre-aggregated (losing the per-request detail needed for novel exploration) and traditional unstructured logs are often too unstructured to query flexibly at scale, even though both remain genuinely useful for their own specific purposes.
OpenTelemetry as the unifying instrumentation layer
OpenTelemetry (covered in more depth in this site’s dedicated article) has become the standard way to instrument applications for all three pillars consistently, specifically because vendor-neutral, consistent instrumentation is a prerequisite for genuinely flexible, ad hoc exploration later — instrumentation tightly coupled to one specific vendor’s proprietary format limits your future ability to explore data in ways that vendor’s tooling didn’t anticipate either.
You need both, not one instead of the other
Observability doesn’t replace monitoring’s value — predefined dashboards and threshold-based alerting remain the right tool for known, well-understood failure modes, since they’re faster to act on than ad hoc exploration. Observability is what you reach for when an alert fires for a reason nobody anticipated, or when investigating a genuinely novel incident — the two are complementary, not competing, approaches.
Takeaway: the real distinction isn’t “old buzzword vs. new buzzword” — it’s “can you only answer questions you anticipated in advance, or can you explore and answer genuinely new questions after the fact.” Both capabilities matter, for different situations.
Comments are powered by Giscus (GitHub Discussions). Enable them by
configuring GISCUS in src/consts.ts — see
giscus.app.