Measuring LLM Inference Performance: Latency, Throughput, and Cost
The metrics that actually matter for LLM serving — TTFT, TPOT, tokens/sec, and cost per request — how they trade off, and how to load-test an inference endpoint.
Load-testing an LLM endpoint is not like load-testing a REST API. A traditional request has one latency number; an LLM request streams, its cost scales with output length, and the server’s throughput depends heavily on how requests are batched. Here are the metrics that matter and how they interact.
The four numbers
- TTFT — Time To First Token. How long before the user sees anything. Dominated by prompt processing (the “prefill”) plus queueing. This is your perceived-latency metric.
- TPOT — Time Per Output Token (a.k.a. inter-token latency). How fast tokens stream once
generation starts.
1 / TPOTis the per-request decode speed in tokens/sec. - End-to-end latency.
TTFT + (output_tokens × TPOT). Long outputs are dominated by the decode term, not prefill. - Throughput. Total output tokens/sec across all concurrent requests. This is the server-level number that drives cost-efficiency.
The trap: per-request latency and server throughput pull in opposite directions. Bigger batches raise throughput (cheaper per token) but raise each request’s TPOT (slower for the individual user). Tuning an inference service is navigating that trade-off.
Why averages lie even more here
Under load, requests queue behind the current batch. A request that arrives just after a large batch starts waits for a decode step before it’s even admitted. The result is a latency distribution with a long, fat tail — so p95 and p99 TTFT are the numbers your users feel. Reporting mean latency for an LLM service is close to malpractice.
Cost is a performance metric
Unlike a CPU-bound API, every token has a price. Cost per request is:
cost = input_tokens/1e6 × input_price
+ output_tokens/1e6 × output_price
Output tokens are typically billed several times higher than input, and they’re also the term that drives latency. So the two highest-leverage optimizations are the same:
- Cap output length.
max_tokensis both a cost control and a tail-latency control. - Raise decode throughput. Continuous batching, speculative decoding, quantization, and KV-cache reuse all increase tokens/sec per dollar.
Estimate your own numbers with the LLM Cost & Latency Estimator.
How to actually load-test it
- Use realistic prompts. Prefill cost scales with input length; a 100-token prompt and a 4,000-token prompt are completely different workloads. Replay production-shaped traffic.
- Pin output length per scenario. Variable
max_tokensmakes results unrepeatable. Test fixed buckets (e.g. 128 / 512 / 2,048 output tokens) separately. - Ramp concurrency, watch the knee. Plot throughput (tokens/sec) and p95 TTFT against concurrent requests. Throughput climbs, then flattens; latency stays flat, then explodes. The knee is your usable capacity.
- Measure streaming, not just total time. A client that waits for the full response hides TTFT entirely — instrument first-token and inter-token timing explicitly.
- Separate prefill- and decode-bound scenarios. Long-prompt/short-output stresses prefill; short-prompt/long-output stresses decode. They scale differently and fail differently.
The takeaway
An LLM service has a latency budget (TTFT + TPOT), a throughput ceiling, and a per-token cost — and they’re all coupled through batch size. Treat it like any other capacity-planning problem: find the knee of the throughput/latency curve, set SLOs on p95 TTFT, and let cost per request keep the optimization honest.
Comments are powered by Giscus (GitHub Discussions). Enable them by
configuring GISCUS in src/consts.ts — see
giscus.app.