Throughput vs Latency: Why You Usually Can't Maximize Both

Why throughput and latency often trade off against each other through batching, and how to decide where to sit on that trade-off curve.

· By perf-test.com Editorial · AI-assisted
throughputlatencyconcepts

Throughput (work done per unit time) and latency (time for one unit of work to complete) sound like they should both simply be “better when higher/lower,” but in many real systems they trade off directly against each other — and understanding the mechanism behind that trade-off helps you deliberately choose where to sit on the curve rather than being surprised by it.

The core mechanism: batching

The most common reason for this trade-off is batching — processing multiple units of work together is often more efficient per unit (better throughput) but requires waiting to accumulate a batch before processing starts, adding latency to any individual item that has to wait for the batch to fill (or for a batch timeout) before being processed at all. This site’s continuous batching article covers exactly this trade-off in the specific context of LLM inference serving — larger batches improve aggregate token throughput but can increase individual request latency under load.

A simple example: database writes

A database that batches writes (accumulating several writes and committing them together) achieves higher write throughput per unit of disk I/O than committing each write individually — but any single write now waits for the batch to fill or for a timeout, adding latency to that specific write compared to an immediate, unbatched commit. Tuning batch size and batch timeout is directly tuning where you sit on this throughput/latency curve.

Network-level batching: Nagle’s algorithm as a classic example

TCP’s Nagle’s algorithm batches small outgoing packets to improve network efficiency (fewer, larger packets, better throughput) at the cost of added latency for small, latency-sensitive messages — a frequently cited real-world example of exactly this trade-off, and the reason many latency-sensitive applications explicitly disable Nagle’s algorithm (TCP_NODELAY) despite its throughput benefits in other contexts.

Why you usually can’t simply have both at their individual best

If your system can independently increase throughput (by batching more aggressively, or running more work concurrently) without affecting latency, that’s not actually a real trade-off — it’s just an unambiguous improvement, and you should make it. The trade-off becomes real specifically when the same lever (batch size, concurrency level) pushes throughput and latency in opposite directions — which is the common case for resource-constrained systems operating anywhere near their capacity ceiling, as covered in this site’s queueing theory article.

Deciding where to sit on the curve

The right point on the throughput/latency curve depends entirely on what your specific use case actually needs: a batch analytics pipeline processing overnight reports should usually maximize throughput, since no individual item’s latency matters to a human waiting for it. An interactive user-facing API should usually prioritize latency, accepting somewhat lower aggregate throughput per server in exchange — and then scale throughput by adding more servers/capacity rather than by batching more aggressively at the cost of per-request latency.

Measuring both together, not in isolation

This site’s LLM API load testing article specifically recommends measuring throughput and latency together across a concurrency sweep, rather than each in isolation — exactly because of this trade-off; a single throughput number or a single latency number, without knowing the corresponding value of the other at that same operating point, doesn’t tell you where on the curve you actually are or whether that’s the right place for your use case.

When the trade-off doesn’t apply

Not every system exhibits this trade-off — some bottlenecks are purely about insufficient capacity, where adding resources improves both throughput and latency together (no real trade-off, just under-provisioning). Recognize the difference: if increasing concurrency/batch size increases throughput and latency simultaneously gets worse specifically because of batching/queueing dynamics (not simply because you’re now over capacity), that’s the genuine trade-off this article describes.

Takeaway: throughput and latency trade off against each other specifically through mechanisms like batching and queueing — recognizing when you’re facing a genuine trade-off (versus simple under-provisioning) lets you deliberately tune toward whichever your specific use case actually needs, rather than chasing both simultaneously past the point where that’s possible.

Discussions coming soon.

Comments are powered by Giscus (GitHub Discussions). Enable them by configuring GISCUS in src/consts.ts — see giscus.app.