How do you estimate LLM inference latency?

End-to-end latency ≈ time-to-first-token (TTFT) + (output tokens ÷ decode speed in tokens/sec). TTFT covers prompt processing/queueing; the decode term dominates for long outputs.

How is LLM token cost calculated?

Cost per request = (input_tokens ÷ 1,000,000 × input_price) + (output_tokens ÷ 1,000,000 × output_price). Multiply by monthly request volume for the monthly bill. Output tokens are usually billed several times higher than input.

LLM Cost & Latency Estimator

Estimate per-request and monthly token cost plus end-to-end latency (TTFT + decode) for an LLM workload. Free, instant, no signup.

Input tokens / request

tok

Output tokens / request

tok

Input price

$ / 1M

Output price

$ / 1M

Requests / month

req

Decode speed

tok / sec

Time to first token

Cost / request—

Cost / month—

Latency (p~avg)—

Input cost / req: —
Output cost / req: —
Generation time: — (output ÷ decode speed)

The model

Two independent estimates, both driven by token counts:

cost_per_request = input_tokens/1e6 * input_price
                 + output_tokens/1e6 * output_price
cost_per_month   = cost_per_request * requests_per_month

latency ≈ TTFT + output_tokens / decode_speed

Output tokens dominate both bills: they're typically priced several times higher than input, and they're the term that scales latency via the decode loop. Two practical levers — cap output length and raise decode throughput (batching, faster hardware, smaller/quantized models) — move both numbers at once.

Caveats

This is a single-request, steady-state estimate. Real p95/p99 latency rises sharply under load as requests queue behind the batch.
Decode speed (tokens/sec) is per-request; server throughput across concurrent requests can be far higher with continuous batching.
Prompt caching, speculative decoding, and KV-cache reuse can change both cost and TTFT materially.

Discussions coming soon.

Comments are powered by Giscus (GitHub Discussions). Enable them by configuring GISCUS in src/consts.ts — see giscus.app.