Prompt Caching and KV Cache: Why Repeated Context Gets Cheaper

How prompt/KV caching reduces cost and latency for repeated context in LLM applications, and when it actually helps versus doesn't.

· By perf-test.com Editorial · AI-assisted
llmcachingkv-cachecost

If your application repeatedly sends the same long system prompt, few-shot examples, or document context with only the final user message changing, prompt caching can meaningfully cut both cost and time-to-first-token — but understanding what’s actually being cached, and its limits, matters for predicting when it will and won’t help.

What gets cached: the KV cache, not the output

LLM inference’s expensive “prefill” phase computes attention key/value (KV) tensors for every token in the input. Prompt caching stores these KV tensors for a previously seen prefix, so a subsequent request sharing that exact prefix can skip recomputing it — reusing the cached KV state and only running prefill on the new, different suffix (typically the user’s specific message). This is purely a prefill-phase optimization; it doesn’t change the decode/generation phase at all.

Why it requires an exact prefix match

KV cache values are positionally and contextually specific — caching only works if the cached prefix is byte-for-byte identical to the start of the new request (including exact whitespace and token boundaries). A system prompt that differs by even a single character, or a few-shot example reordered, invalidates the cache for that request entirely. This is why prompt caching is most valuable for applications with a large, genuinely static shared prefix (a fixed system prompt, a fixed long document being repeatedly queried) rather than for highly variable prompts.

Provider-side automatic caching vs explicit cache control

Some hosted LLM APIs cache automatically based on detecting repeated prefixes across requests (often with a cache lifetime measured in minutes), while others require explicit cache breakpoints/markers in the request to control what gets cached and reused. Check your specific provider’s documentation — the cost and latency savings only materialize if requests are actually structured to hit the cache, and providers differ meaningfully in how automatic versus explicit this is.

The latency benefit is concentrated in TTFT

Since caching only affects the prefill phase, its latency benefit shows up specifically in Time To First Token (TTFT) — a request with a long cached prefix and a short new suffix can see dramatically reduced TTFT compared to processing the full prompt from scratch, while the decode-phase latency (TPOT, covered in this site’s LLM inference metrics article) for generating the actual output tokens is unaffected either way.

Cost implications

Providers offering prompt caching typically bill cached input tokens at a significantly reduced rate versus fresh input tokens — for applications with large repeated contexts (a long document being queried many times, an agent with an extensive fixed system prompt and tool definitions), this can be one of the largest concrete cost levers available, often a bigger win than model selection itself for that specific cost line.

Self-hosted serving and KV cache reuse

For self-hosted inference (vLLM, TGI, and similar), the same underlying KV cache reuse concept applies, and is closely related to the continuous batching and PagedAttention techniques covered in this site’s continuous batching article — efficient KV cache management across both prompt caching and dynamic batch composition is a major engineering focus of modern inference servers precisely because it affects cost and latency this directly.

When caching won’t help

Highly dynamic, mostly-unique prompts (no significant shared prefix across requests) see little to no benefit — don’t expect caching to meaningfully improve cost or latency for workloads structurally lacking the repeated-prefix pattern it depends on.

Takeaway: prompt caching is a prefill-specific optimization that pays off specifically for applications with large, genuinely static shared context — structuring prompts to maximize a stable, reusable prefix (fixed instructions first, variable content last) is the practical lever for actually capturing the benefit.

Discussions coming soon.

Comments are powered by Giscus (GitHub Discussions). Enable them by configuring GISCUS in src/consts.ts — see giscus.app.