Batching, caching & throughput: cutting $/token

Q: What does prompt caching actually charge for?

Caching reprices the repeated portion of your input. The first time a prefix is sent it is written to cache (sometimes at a small premium); on later requests that cached prefix is billed at a fraction of the normal input rate — commonly in the 10–25% range depending on provider and cache duration. Output tokens are never cached, and the uncached tail of the prompt is billed normally. The win scales with how much of your prompt is identical across requests.

✍️ Francesco ZinghinìUpdated Jun 25, 2026⏱️ 8 min

The sticker price per million tokens is rarely what you actually pay. Three levers — batch processing, prompt caching, and raw throughput — can move your effective cost per token by 2–10× without changing the model you use. This guide explains the math behind each, how they stack, and which calculators on this site let you plug in your own numbers.

Why "effective $/token" is the only number that matters

Every cost decision on this site reduces to one figure: the effective dollars you spend per million tokens for a given workload. On the API side that starts from the blended list price — input and output prices weighted by your token mix. On the self-hosting side it comes from a GPU's hourly rate divided by the tokens it produces. Both numbers are before optimization. Batching, caching, and throughput are the three ways to drive the effective figure below the sticker figure.

The reason they are worth their own guide is that they don't show up in a naive estimate. If you multiply your monthly tokens by a provider's headline price, you will overstate API spend — sometimes by half — and you will badly understate what a well-tuned self-hosted GPU can do. Getting the effective number right is what makes the API vs self-hosting comparison honest.

Batch processing: about half off, if you can wait

Most large API providers offer an asynchronous batch tier. You submit a job, the provider runs it when it has spare capacity, and you collect results later — often within minutes, with a ceiling commonly around 24 hours. In exchange, the per-token price is discounted, typically on the order of 50% for both input and output.

The math is the easy part. If your blended list price is $3.00 per 1M tokens, a 50% batch discount makes the effective price $1.50 per 1M. Across 100M tokens a month that is the difference between $300 and $150 — pure margin for jobs that were never time-sensitive in the first place.

Great fits: bulk classification and tagging, embeddings generation, offline evaluation runs, document or transcript processing, synthetic data generation, nightly report summarization.
Poor fits: anything a human is waiting on — chat, autocomplete, interactive agents.

Batch discount math.
Effective price = list price × (1 − discount)
At a 50% discount: $3.00/1M → $1.50/1M
Monthly saving = monthly tokens ÷ 1M × (list − effective)

The practical catch is operational, not arithmetic: you have to architect the workload to tolerate delay, handle partial failures, and reconcile results out of band. That is engineering time, but it is one-time, and for a steady offline pipeline the payback is immediate.

Prompt caching: pay full price once, a fraction thereafter

Prompt caching attacks a different part of the bill. Many workloads send the same large prefix on every request — a long system prompt, a tool schema, a few-shot block, a fixed document. Without caching you pay full input price for those tokens every single call. With caching, the provider stores the processed prefix and bills it at a steep discount on subsequent hits, commonly 10–25% of the normal input rate, with the first write sometimes carrying a small premium.

The savings depend entirely on your cache hit ratio — how much of each prompt is the repeated, cacheable prefix versus the unique tail. Consider a request with a 9,000-token fixed system prefix and a 1,000-token unique user turn:

Uncached input cost ∝ 10,000 tokens at 100% of input price.
Cached: 9,000 tokens at ~10% + 1,000 tokens at 100% ≈ the cost of 1,900 tokens — roughly an 81% cut on input for that request.

Two things to keep straight: output tokens are never cached (generation is always billed in full), and caches expire, so sparse or bursty traffic gets fewer hits than a steady stream. The cached & batch discount calculator lets you enter your cache hit ratio and discount rate and see the blended effect, including stacking with batch.

Throughput: the self-hosting version of a discount

Self-hosting has no list price to discount — it has a GPU that bills by the hour regardless of how busy it is. There your "discount" is throughput. The relationship is exact:

Self-hosted $/1M from throughput.
$/1M = ($/hour ÷ 3,600) ÷ (tok/s × utilization) × 1,000,000
Example: $1.50/hr at 2,000 tok/s and 100% utilization ≈ $0.21 per 1M.
At 25% utilization the same GPU costs ≈ $0.83 per 1M — 4× more.

Two variables move this: how many tokens per second the serving stack produces, and what fraction of the clock it spends producing them. Server-side continuous batching is what makes tok/s large — by running many requests through the same forward pass, modern inference frameworks reach throughput an order of magnitude above single-stream decode. A GPU that decodes 60 tok/s for one user might sustain 2,000+ tok/s aggregate under batched load.

Utilization is the other half. A GPU that only generates half the time produces half the tokens for the same bill, so its cost per token doubles. This is why the cheapest self-hosting setups are steady, high-volume pipelines, not bursty ones. Plug a measured tok/s and utilization into the throughput cost calculator to see the $/1M, and use the GPU cloud cost calculator to translate an hourly rate into a monthly bill.

Putting the levers together

The three levers act on different inputs, so they compose rather than compete:

Batch discounts the whole API job — best when latency is flexible.
Caching discounts the repeated input — best when prompts share a large fixed prefix.
Throughput lowers self-hosted $/1M — best when you can keep the GPU saturated.

A realistic API pipeline can stack the first two: a cached prefix billed at 10% of input, then submitted as a batch at 50% off, lands that prefix near 5% of its original price. A realistic self-hosting plan leans on the third: invest in batching and steady load until the $/1M falls below the (already discounted) API price.

That is exactly the comparison the rest of the site is built around. Measure your token mix and cache hit ratio, apply the discounts in the discount calculator, derive a self-hosted $/1M in the throughput cost calculator, and only then compare the two — because the honest comparison is always effective price versus effective price, never sticker versus sticker.

A worked end-to-end example

Suppose an offline document-processing pipeline runs 200M input tokens and 40M output tokens a month. The model lists at $3.00 per 1M input and $15.00 per 1M output, so the naive blended price is about $5.00 per 1M and the naive bill is roughly $1,200 a month. Now apply the levers in order.

Caching. Each request carries an identical 8,000-token instruction-and-schema prefix, and the unique payload averages 2,000 tokens. That makes 80% of input cacheable. Billed at 10% of input rate, the cached prefix costs a tenth of what it did, cutting input spend by roughly 72% — input drops from about $600 to about $170.
Batch. The job is overnight, so the whole thing runs through the batch tier at 50% off. Both the (already cached) input and the full-price output halve again.

The output side — $600 at list — has no cache discount but still gets the 50% batch cut to about $300. The input side, already near $170 after caching, halves to about $85. The monthly bill lands near $385 instead of $1,200 — roughly a 3× reduction, achieved without touching the model or the prompts' content. That is the kind of swing that decides whether the API or a GPU is the cheaper home for a workload.

Common mistakes that inflate the effective price

Pricing at list when the workload is offline. If a job can tolerate delay and you aren't batching it, you are leaving the headline discount on the table — and overstating the case for self-hosting.
Assuming a cache hit ratio you don't measure. Bursty traffic lets caches expire between requests; the realized hit ratio can be far below the theoretical one. Measure it before you bank the savings.
Borrowing someone else's tok/s. Self-hosted throughput varies by an order of magnitude across model, GPU, quantization, and framework. A number from a blog post will mislead your $/1M; benchmark your own stack under realistic batched load.
Forgetting utilization. A high tok/s at 25% utilization still produces a high $/1M, because three-quarters of the GPU's billed seconds generate nothing. Throughput and utilization must be modeled together.

Tie each lever back to a tool: estimate API discounts in the cached & batch discount calculator, convert throughput to $/1M in the throughput cost calculator, and turn an hourly rate into a monthly bill with the GPU cloud cost calculator. The discipline is always the same: optimize each effective price first, compare second.

Frequently asked questions

How big is the batch API discount, really?

For the major providers, asynchronous batch processing typically lists at roughly 50% off both the input and output per-token price, in exchange for a relaxed turnaround (often up to 24 hours). It applies to the whole job, so a workload that would cost $3.00 per 1M blended drops to about $1.50 per 1M with no code rewrite beyond submitting work as a batch. Confirm the current discount and turnaround with your provider before you bank on it.

What does prompt caching actually charge for?

Caching reprices the repeated portion of your input. The first time a prefix is sent it is written to cache (sometimes at a small premium); on later requests that cached prefix is billed at a fraction of the normal input rate — commonly in the 10–25% range depending on provider and cache duration. Output tokens are never cached, and the uncached tail of the prompt is billed normally. The win scales with how much of your prompt is identical across requests.

Why does throughput change the self-hosting cost per token?

A rented GPU bills by the clock, not by the token. If you double the tokens it generates per second, you spread the same hourly cost over twice as many tokens and the cost per token halves. That is why batching matters even more for self-hosting than for the API: it is the lever that turns a fixed hourly bill into a low $/1M. The throughput cost calculator turns tok/s into $/1M directly.

Can I combine batch and cache discounts?

Often yes, and they stack multiplicatively on the portions they each touch. A cached prefix billed at 10% of input, then run through a batch job at 50% off, lands near 5% of the original input price for that prefix. But discounts apply to different parts of the bill — caching to repeated input, batching to the whole job — so model them against your actual token mix rather than assuming a flat headline number.

Does batching hurt latency?

Asynchronous batch APIs trade latency for price: results can take minutes to hours, so they suit offline jobs (evaluations, bulk classification, embeddings, document processing), not interactive chat. Server-side batching for self-hosting is different — it raises throughput while keeping per-request latency acceptable, because requests in flight at the same time share a forward pass. Know which kind of batching you mean.

Sources & pricing references

Disclaimer. LLMTCO provides cost estimates and planning tools for informational purposes only. AI API and GPU prices change frequently; bundled defaults reflect publicly listed prices as of the verification date shown (Jun 25, 2026) and may be out of date — always confirm current pricing with the provider. These figures are estimates, not financial, tax, or procurement advice, and do not capture every real-world factor (latency, reliability, compliance, data privacy, engineering time).