LLM API Cost & Self-Hosting TCO Comparator

🛠️ Free toolUpdated Jun 25, 2026By Francesco Zinghinì

Answer the only question that matters for inference spend: is it cheaper to call the API or self-host an open-weight model? Enter your monthly token volume and a GPU, and get the monthly cost both ways, the break-even volume, the cost-vs-volume curve, and 12/24/36-month TCO. Numbers update as you type. Prices as of Jun 25, 2026 — sources; every price is editable.

Monthly workload

Input tokens / month

Output tokens / month

Load profile

Commercial API

Model / provider

Input price ($/1M)

Output price ($/1M)

Self-hosting

Open-weight model

GPU / instance

GPU cost ($/hour)

Utilization (%)

Ops overhead ($/mo)

Currency

API is cheaper at your current volume. API ≈ $60.00/mo vs self-hosting ≈ $348.21/mo. Self-hosting breaks even at 69.6M tokens/month.

API / month$60.00

Self-hosting / month$348.21

Break-even volume69.6M

Blended API price$5.00 / 1M

API (grows with volume) Self-hosting (fixed) break-even

Total cost of ownership
Horizon	API	Self-hosting	Difference
12 months	$720	$4,179	+$3,459
24 months	$1,440	$8,357	+$6,917
36 months	$2,160	$12,536	+$10,376

TCO assumes the rented GPU at the chosen utilization; owned-hardware amortization is covered by the GPU TCO calculator.

Cheaper isn't the whole story. The verdict is about cost only. Self-hosting also affects latency (can be faster or slower), compliance & data residency, privacy, reliability/uptime, and engineering time. Weigh those alongside the dollar figures.

How the comparison works

Two cost structures meet here. The API is pure marginal cost: you pay per token, so cost is a straight line through the origin. Self-hosting is mostly fixed cost: the GPU bills by the hour whether or not you use it, so its line is roughly flat until you saturate capacity. They cross at the break-even volume.

Formulas.
API/month = (input ÷ 1M × price_in) + (output ÷ 1M × price_out)
Self-host/month = $/hour × 730 × utilization + overhead
Blended price/1M = (input × price_in + output × price_out) ÷ (input + output)
Break-even tokens = Self-host monthly ÷ blended price/1M × 1,000,000

A worked example

Using the defaults — 10M input + 2M output tokens/month, Claude Sonnet-class (Anthropic) at $3/$15 per 1M, versus an NVIDIA A100 80GB at $1.59/hour and 30% utilization:

API: (10M ÷ 1M × $3) + (2M ÷ 1M × $15) = $60.00/mo
Self-host: $1.59 × 730 × 0.30 = $348.21/mo
Blended API price: $5.00 per 1M tokens
Break-even: $348.21 ÷ $5.00 × 1M ≈ 69.6M tokens/month

So at this volume the API is far cheaper; self-hosting only pays off past roughly 69.6M tokens/month — and only if you can keep the GPU that busy. Change any input above to model your own case, then copy the shareable link to send a specific scenario to a teammate.

Frequently asked questions

At what volume does self-hosting an LLM become cheaper than the API?

Self-hosting has a roughly fixed monthly cost (the GPU runs whether you use it or not), while API cost grows with every token. The crossover is the break-even volume: self-hosting monthly cost ÷ blended API price per token. In the default scenario that's about 69.6M tokens/month. Below it the API wins; above it self-hosting wins on raw cost.

Is self-hosting really cheaper than paying for the API?

Only at high, steady volume. At 10M input + 2M output tokens/month, the API costs about $60.00/mo versus $348.21/mo to rent the GPU at 30% utilization — the API is far cheaper. Self-hosting only pays off once you can keep the GPU busy enough to spread its fixed cost across many tokens.

What costs does the self-hosting estimate include?

The rented-GPU estimate is hourly rate × hours in the month × utilization, plus any operational overhead you enter (DevOps time, monitoring, redundancy). It does not include hidden costs like idle capacity, reliability engineering, or latency trade-offs — those are caveats in the verdict, not dollar figures. See the methodology.

Why does the output token price matter so much?

Output tokens are usually 3–5× more expensive than input tokens, and generation is the slow part. A workload that is output-heavy (long completions) costs far more per request than an input-heavy one (long context, short answer), which also shifts the break-even point.

How current are these prices?

The bundled defaults are publicly listed prices verified on Jun 25, 2026, each linked to its source. They are convenience defaults only — every price is an editable input, so the calculator stays correct even if a default goes stale. Always confirm current pricing with the provider.

Sources & pricing references

Disclaimer. LLMTCO provides cost estimates and planning tools for informational purposes only. AI API and GPU prices change frequently; bundled defaults reflect publicly listed prices as of the verification date shown (Jun 25, 2026) and may be out of date — always confirm current pricing with the provider. These figures are estimates, not financial, tax, or procurement advice, and do not capture every real-world factor (latency, reliability, compliance, data privacy, engineering time).