Distributed Tracing Explained: Spans, Context, and Sampling

How distributed tracing actually works under the hood — spans, trace context propagation, and sampling strategies — explained from first principles.

· By perf-test.com Editorial · AI-assisted
distributed-tracingobservabilitymicroservices

Distributed tracing reconstructs a single request’s full path across multiple services into one coherent timeline, and understanding the underlying mechanics (covered at a higher level in this site’s OpenTelemetry article) helps when debugging tracing gaps or designing instrumentation for a new service.

Spans: the basic unit

A span represents one unit of work — typically one service’s handling of one operation (an HTTP request, a database query, a function call). Each span has a start time, duration, a name, a set of key-value attributes, and a reference to its parent span (if any) — the parent/child relationships across all spans in a request form the trace’s tree structure.

Trace ID and span ID: linking everything together

Every span carries a shared trace ID (identifying which overall request/trace it belongs to) and its own unique span ID, plus a reference to its parent span ID. A tracing backend reconstructs the full trace tree by collecting all spans sharing a trace ID and assembling them according to their parent/child references — this is the actual mechanism behind what looks like a unified “waterfall” view in tracing UIs.

Context propagation: how the trace ID crosses service boundaries

For a trace to remain connected across an inter-service call (Service A calls Service B over HTTP), Service A must include the current trace ID (and its own span ID, becoming the parent for whatever span Service B creates) in the outgoing request — typically as HTTP headers, standardized by the W3C Trace Context specification (traceparent header). If this propagation step is missing or broken for any specific call path, the trace silently fragments into disconnected pieces at that boundary — one of the most common real-world tracing gaps, especially for less common communication paths (message queues, gRPC with custom interceptors, third-party SDKs that don’t propagate headers automatically).

Span attributes: what makes a trace actually useful

A span with just a name and duration tells you that something took time; span attributes (key-value metadata — a specific customer ID, a query’s table name, an HTTP status code, a cache hit/miss flag) are what let you later ask the high-cardinality, ad hoc questions covered in this site’s monitoring vs observability article. Deciding what attributes to attach during manual instrumentation is one of the highest-leverage instrumentation design decisions — generic spans with minimal attributes provide much less diagnostic value than the same spans enriched with the specific details you’ll actually want to filter and group by later.

Sampling strategies and their trade-offs

  • Head-based sampling — the decision to keep or discard a trace is made at the very start (often randomly, at a configured rate), before knowing the trace’s outcome. Simple and cheap, but can easily discard the rare slow or failed traces that matter most diagnostically, purely by random chance.
  • Tail-based sampling — the decision is deferred until the full trace completes, allowing rules like “always keep traces with an error” or “always keep traces slower than X” — much better at preserving diagnostically valuable traces, at the cost of needing to buffer span data until a trace completes before deciding, which is more complex and resource-intensive to implement (typically handled at the Collector level, as covered in this site’s OpenTelemetry article).

Trace context across asynchronous boundaries

Asynchronous processing (a message placed on a queue, processed later by a different process) breaks the simple synchronous parent-child span model — properly propagating trace context across these boundaries (embedding the trace ID in the queued message itself, and having the consumer create a child span referencing it) requires deliberate instrumentation effort, and is a common place where tracing coverage quietly degrades if not specifically addressed.

Takeaway: distributed tracing’s diagnostic value comes from the combination of reliable context propagation (so traces don’t fragment) and rich span attributes (so traces support genuinely ad hoc exploration) — both require deliberate instrumentation effort beyond whatever a framework’s automatic instrumentation provides by default.

Discussions coming soon.

Comments are powered by Giscus (GitHub Discussions). Enable them by configuring GISCUS in src/consts.ts — see giscus.app.