Token Economics 101: Understanding LLM API Cost Structure
How LLM API pricing actually works — input vs output token pricing, why output costs more, and the practical levers for controlling cost.
Understanding token-based pricing is the foundation for any serious cost planning around LLM-powered features — and a few non-obvious aspects of how it works trip up teams budgeting for the first time.
Tokens, not words or characters
LLM pricing and context limits are denominated in tokens, sub-word units produced by the model’s tokenizer — roughly 4 characters or three-quarters of a word in English on average, though this varies by language (some languages tokenize considerably less efficiently than English) and by content type (code, especially with unusual formatting, often tokenizes less efficiently than prose). Estimating cost from word count alone is directionally fine for rough English-prose estimates but can be meaningfully off for code-heavy or non-English content.
Why output tokens cost more than input tokens
Most providers price output tokens at several times the rate of input tokens (commonly 3-5x, varying by provider and model). This reflects the underlying compute asymmetry covered in this site’s LLM inference metrics article: input tokens are processed once, in parallel, during prefill; output tokens are generated one at a time, sequentially, during decode — each output token requires its own full forward pass through the model, making it inherently more expensive to produce than to merely read.
The practical implication: output length is your biggest cost lever
Because output tokens dominate both cost and latency (also covered in this site’s LLM inference metrics article), capping max_tokens appropriately for your actual use case is usually the single highest-leverage cost control available — more impactful for most applications than switching to a marginally cheaper model, and it helps latency simultaneously.
Context window cost accumulation in multi-turn conversations
In a multi-turn chat application, each new turn typically resends the entire conversation history as input tokens (since most LLM APIs are stateless between calls) — a long-running conversation’s input token cost grows with conversation length, even though the user only typed one new short message. This is a commonly underestimated cost driver; teams budgeting based on “average single message length” rather than “average full conversation context size at the point of each call” frequently underestimate real costs substantially.
Why prompt caching matters for cost, not just latency
This site’s dedicated prompt caching article covers the latency angle; the cost angle is equally significant — cached input tokens are typically billed at a steep discount, meaning applications with large, stable shared context (long system prompts, large reference documents) can cut a meaningful fraction of their input-token cost simply by structuring prompts to maximize cacheable, stable prefixes.
Estimating cost before building
Before committing to a feature’s design, estimate: average input tokens per call (including accumulated conversation history if multi-turn), average output tokens per call, expected call volume, and the specific model’s per-token pricing — multiplying these out gives a real cost estimate worth sanity-checking against your budget before the feature ships, not after the first surprising bill. This site’s LLM cost & latency estimator does exactly this calculation.
Model selection as a cost lever, with real trade-offs
Smaller/cheaper models cost less per token but may need more output tokens to reach an equivalent quality answer (more verbose reasoning, more retries on failure) or may simply produce lower-quality output for complex tasks — a true cost comparison needs to account for quality-adjusted cost, not just sticker price per token, especially for tasks where a cheaper model’s higher failure/retry rate could erase its per-token price advantage.
Takeaway: output token volume — driven by max_tokens settings and by accumulated conversation history in multi-turn applications — is usually the dominant, most controllable cost factor in LLM-powered applications, more so than model selection alone.
Comments are powered by Giscus (GitHub Discussions). Enable them by
configuring GISCUS in src/consts.ts — see
giscus.app.