Observability

OpenTelemetry, Prometheus, Grafana, Datadog, Dynatrace, New Relic.

A practical introduction to OpenTelemetry's traces, metrics, and logs, and how to instrument a service for meaningful performance analysis.

How Prometheus's pull-based metrics model and PromQL work, and how to build Grafana dashboards that actually answer performance questions.

How the RED method gives a simple, consistent framework for monitoring any request-driven service, and how it complements the USE method.

How distributed tracing actually works under the hood — spans, trace context propagation, and sampling strategies — explained from first principles.

Why structured logging (key-value fields, not free text) matters for debugging at scale, and practical conventions worth adopting.

How Brendan Gregg's USE method systematically checks system resources for performance bottlenecks, and how it pairs with the RED method.

A practical comparison of how Datadog, Dynatrace, and New Relic approach instrumentation, AI-assisted root-cause analysis, and pricing.

How to design an SLO dashboard that actually informs the ship/freeze decisions error budgets are meant to enable, not just display pretty graphs.

How synthetic monitoring and real user monitoring complement each other for understanding production performance, and when to rely on each.