How LLM inference cost is calculated

Almost every dollar you will ever spend on LLM inference reduces to one small piece of arithmetic: tokens multiplied by a per-token price. Get that math right and the rest — monthly budgets, provider comparisons, the self-hosting question — follows cleanly. This guide walks through exactly how inference cost is calculated, why input and output are priced differently, and how to turn a vague sense of "we send a lot of text" into a defensible monthly number.

The unit of cost is the token, not the request

An LLM does not bill per request, per user, or per minute. It bills per token — a sub-word fragment the tokenizer produces from your text. As a rule of thumb, one token is about four characters of English, or roughly three-quarters of a word. A 500-word email is therefore close to 650 tokens; this guide is a few thousand. The exact count depends on the tokenizer and the language (code and non-English text tokenize less efficiently), but the heuristic is good enough for budgeting.

Because the token is the unit, the first thing to internalize is that length drives cost linearly. Double the prompt, double the input cost. Ask for an answer twice as long, double the output cost. There is no volume floor and no rounding up to a "plan" — which is liberating at small scale and dangerous at large scale, because nothing stops the meter from running.

Two meters: input and output

The single most important fact about LLM pricing is that there are two meters, not one. The input meter counts every token in your prompt: the system message, the conversation history, any retrieved documents (RAG context), and the user's new question. The output meter counts every token the model writes back. Each has its own price, and the two are simply added.

These two prices are rarely equal. Output is almost always 3–5× more expensive than input. The reason is mechanical: input tokens are consumed in a single parallel "prefill" pass, but output tokens are generated autoregressively — one token at a time, each one requiring a full forward pass through the model. Generation is the slow, GPU-bound part of inference, and the price reflects it.

The core formula.
Input cost = input tokens ÷ 1,000,000 × input price
Output cost = output tokens ÷ 1,000,000 × output price
Total = input cost + output cost
Blended price / 1M = (input × input price + output × output price) ÷ (input + output)

The token cost calculator applies exactly this formula and breaks out where the money goes. The split matters: two teams pushing identical total token volume can receive wildly different invoices — one summarizing long documents into short notes (input-heavy, cheap), the other drafting long articles from short briefs (output-heavy, expensive).

A worked example

Suppose you send 1,000,000 input tokens and generate 500,000 output tokens in a month, on a model priced at $3.00 per million input and $15.00 per million output — a typical mid-tier mix where output is 5× input.

  • Input: 1,000,000 ÷ 1,000,000 × $3.00 = $3.00
  • Output: 500,000 ÷ 1,000,000 × $15.00 = $7.50
  • Total: $3.00 + $7.50 = $10.50
  • Blended price: (1,000,000 × $3.00 + 500,000 × $15.00) ÷ 1,500,000 = $7.00 per 1M

Notice that the half-million output tokens cost more than twice the full million input tokens. That single observation explains most "why is our bill so high?" surprises: it is almost never the prompt, it is the generation length. The blended figure of $7.00 per million is the number to carry forward when you compare providers or estimate self-hosting break-even — it already accounts for your real 2:1 mix.

The $/1M model: one number to rule comparisons

Pricing pages quote two numbers per model (input and output), which makes head-to-head comparison awkward. The blended $/1M tokens figure collapses both into one rate tied to your traffic shape. It is the right basis for almost every downstream decision:

  • Provider comparison. A provider with cheap input but expensive output may lose to a "pricier" rival once your output-heavy mix is applied. Always blend before you rank.
  • Budgeting. Monthly cost is just blended price × monthly token volume ÷ 1,000,000 — the basis of the monthly API spend calculator.
  • Self-hosting break-even. The crossover with a fixed-cost GPU is found by dividing the GPU's monthly cost by your blended price — covered in the break-even guide.

One caution: the blended price is only valid for the mix it was computed on. Change your input:output ratio and it moves. If you run several distinct workloads (a chat product and a batch summarization job, say), blend each separately rather than averaging the whole account.

From a single call to a monthly bill: estimating volume

The formula is trivial; the hard part is estimating monthly token volume before you have a bill to look at. Work from the request, not the month:

  • Tokens per request. Estimate the average prompt size (system prompt + context + user message) and the average completion size. The token estimator converts words or characters to tokens.
  • Requests per month. Multiply requests per active user by your user count, or take a measured request rate and multiply by the seconds in a month.
  • Multiply and split. Monthly input tokens = avg input per request × requests; monthly output tokens = avg output per request × requests. Feed both into the formula.

Two adjustments catch people out. First, conversation history compounds: in a multi-turn chat, every prior turn is re-sent as input on the next call, so input grows quadratically with conversation length unless you truncate or summarize. Second, retrieval-augmented generation can dwarf the user's actual question — a few retrieved documents can be thousands of input tokens per call. Both inflate the input side specifically, which is the cheap side, but at scale even cheap tokens add up.

Watch the input side at scale. Because output is priced higher, teams instinctively guard generation length and ignore the prompt. But long system prompts, full chat history, and large RAG context are sent on every single call. At high request volume, a bloated-but-cheap prompt can quietly become the largest line on the bill. Prompt caching exists precisely to attack this.

Where the simple model stops

The token formula is exact for straight, list-price API usage, and that covers the majority of real spend. A few factors sit on top of it:

  • Caching and batch discounts reduce the effective price for reused prompts or offline jobs — see the cached & batch discount calculator.
  • Reasoning / "thinking" tokens on some models are billed as output even though the user never sees them, which can multiply the output side.
  • Price changes. Per-million prices move; the bundled defaults on this site are dated and sourced, and every price is an editable input so the math stays correct even when a default goes stale.

None of these change the underlying arithmetic — they adjust the price you plug into it. The discipline is always the same: count tokens on both meters, multiply by their prices, add. For the precise definitions and rounding rules behind every figure on this site, see the methodology; to put the formula to work, start with the token cost calculator for a single workload and the monthly API spend calculator to project it across a month of traffic.

Frequently asked questions

How is the cost of an LLM API call calculated?

Every commercial API meters two separate quantities: the tokens you send (input/prompt) and the tokens the model generates (output/completion). Each has its own per-million price, and your bill is simply (input ÷ 1,000,000 × input price) + (output ÷ 1,000,000 × output price). There is no seat fee or monthly minimum — you pay strictly for tokens that flow through.

What is a token, in practical terms?

A token is a sub-word chunk the model reads or writes — on average about ¾ of an English word, or roughly 4 characters. So one million tokens is on the order of 750,000 words. To estimate from words, multiply by about 1.33; from characters, divide by about 4. These are heuristics; the exact count depends on the tokenizer and language. The token estimator does the conversion for you.

Why are output tokens more expensive than input tokens?

Output is the compute-heavy half of inference. Input tokens are processed in a single parallel prefill pass, while output tokens are produced one at a time, each requiring a full forward pass through the network. Providers therefore typically price output 3–5× higher than input, which is why an output-heavy workload costs far more per request than an input-heavy one of the same total size.

What is the blended price per million tokens?

The blended price is the single average rate you actually pay across your specific input/output mix: (input × input price + output × output price) ÷ (input + output). It already weights each side by how much of it you use, so it is the most honest single number for comparing one workload across providers.

Do these figures include caching or batch discounts?

No. The core token math is list-price and undiscounted. Prompt caching can cut the input portion when you reuse a large fixed prompt, and offline batch processing often earns a flat discount. Model those separately with the cached & batch discount calculator.

Disclaimer. LLMTCO provides cost estimates and planning tools for informational purposes only. AI API and GPU prices change frequently; bundled defaults reflect publicly listed prices as of the verification date shown (Jun 25, 2026) and may be out of date — always confirm current pricing with the provider. These figures are estimates, not financial, tax, or procurement advice, and do not capture every real-world factor (latency, reliability, compliance, data privacy, engineering time).