Continuous Batching: How Modern LLM Servers Achieve High Throughput

How continuous batching differs from static batching, why it's central to vLLM and TGI's throughput advantage, and what it costs individual requests.

· By perf-test.com Editorial · AI-assisted
llminferencebatchingvllm

Static batching — the traditional approach of grouping a fixed set of requests together, running them through the model, and waiting for the entire batch to finish before starting the next — wastes enormous GPU capacity for LLM inference, because requests in a batch finish generating at very different times depending on output length. Continuous batching fixes this, and is the single biggest reason modern inference servers (vLLM, TGI, and similar) achieve dramatically higher throughput than naive batching implementations.

The problem with static batching

If you batch 8 requests together and one needs 500 output tokens while another only needs 20, the short request finishes generating in a fraction of the time — but with static batching, its GPU slot sits idle until the entire batch completes, since the batch shape is fixed for the whole forward-pass sequence. At scale, this wastes a large fraction of available compute.

How continuous batching works

Continuous batching (sometimes called “in-flight batching”) allows the server to add new requests into a running batch as soon as any slot frees up — when the short request finishes, a new incoming request immediately takes its place in the very next forward pass, without waiting for the whole batch to complete. The batch composition changes continuously, request by request, rather than being fixed for an entire generation cycle.

Why this matters specifically for autoregressive generation

LLM text generation is inherently sequential — each new token depends on all previous tokens, computed one decode step at a time. This means there’s a natural, frequent opportunity (every single decode step) to check whether any request finished and a new one can be slotted in — continuous batching exploits this fine-grained opportunity, which a more coarse-grained traditional ML batching approach (designed for non-autoregressive workloads) wouldn’t naturally support.

The trade-off: per-request latency variability

Continuous batching maximizes aggregate throughput, but an individual request’s inter-token latency (TPOT, covered in this site’s LLM inference metrics article) can vary depending on how many other requests are sharing the batch at any given moment — a request might decode quickly when the batch is lightly loaded and slow down as more requests join. This is the real-world manifestation of the throughput-vs-per-request-latency trade-off covered elsewhere on this site: continuous batching is explicitly optimizing for the throughput side of that trade-off.

KV cache management is the other half of the story

Continuous batching’s effectiveness depends heavily on efficient KV cache management (each request’s attention key/value cache must be tracked and reused across its own decode steps without excessive memory fragmentation as requests join and leave the batch dynamically) — this is exactly what vLLM’s PagedAttention technique addresses, treating KV cache memory allocation similarly to how an OS manages virtual memory pages, avoiding the fragmentation that naive contiguous-memory KV cache allocation would suffer under constantly changing batch composition.

What this means for load testing LLM endpoints

This site’s article on measuring LLM inference performance recommends testing throughput and latency as a coupled function of concurrency, not independently — continuous batching is precisely why: the achievable throughput and the latency any individual request experiences both depend on how many concurrent requests are in flight, which is the variable a proper load test needs to sweep across to find the actual operating curve.

Takeaway: continuous batching is what separates a production-grade LLM inference server from a naive implementation — it’s the mechanism that lets aggregate throughput scale well with concurrency, at the cost of making individual-request latency dependent on current system load rather than constant.

Discussions coming soon.

Comments are powered by Giscus (GitHub Discussions). Enable them by configuring GISCUS in src/consts.ts — see giscus.app.