LLM Cost & Latency Estimator
Estimate per-request and monthly token cost plus end-to-end latency (TTFT + decode) for an LLM workload. Free, instant, no signup.
- Input cost / req: —
- Output cost / req: —
- Generation time: — (output ÷ decode speed)
The model
Two independent estimates, both driven by token counts:
cost_per_request = input_tokens/1e6 * input_price
+ output_tokens/1e6 * output_price
cost_per_month = cost_per_request * requests_per_month
latency ≈ TTFT + output_tokens / decode_speed Output tokens dominate both bills: they're typically priced several times higher than input, and they're the term that scales latency via the decode loop. Two practical levers — cap output length and raise decode throughput (batching, faster hardware, smaller/quantized models) — move both numbers at once.
Caveats
- This is a single-request, steady-state estimate. Real p95/p99 latency rises sharply under load as requests queue behind the batch.
- Decode speed (tokens/sec) is per-request; server throughput across concurrent requests can be far higher with continuous batching.
- Prompt caching, speculative decoding, and KV-cache reuse can change both cost and TTFT materially.
Comments are powered by Giscus (GitHub Discussions). Enable them by
configuring GISCUS in src/consts.ts — see
giscus.app.