API vs self-hosting: how break-even actually works

✍️ Francesco ZinghinìUpdated Jun 25, 2026⏱️ 9 min

"Should we just self-host?" is the question every team asks once the API invoice gets large enough to notice. The honest answer is a single number — the break-even volume — and a single lever that moves it more than anything else: utilization. This guide explains why the API and self-hosting have fundamentally different cost shapes, derives the crossover from first principles, reproduces it with a concrete worked example, and shows why a GPU you do not keep busy is the fastest way to lose the comparison.

Two different cost shapes

The reason this comparison is interesting at all is that the two options have opposite cost structures. The API is pure marginal cost: you pay per token and nothing otherwise, so plotted against volume its cost is a straight line through the origin. Use zero tokens, pay zero. Use twice as many, pay twice as much.

Self-hosting is mostly fixed cost. A rented GPU bills by the hour regardless of how many tokens pass through it. Plotted against volume, its cost is roughly flat — you pay the same monthly amount whether the GPU serves a thousand tokens or a billion, right up until you saturate its throughput and have to add a second one. A flat line and a rising line cross at exactly one point. That crossing is the break-even volume.

The formulas.
API / month = (input ÷ 1M × price_in) + (output ÷ 1M × price_out)
Self-host / month = $/hour × 730 × utilization + overhead
Blended price / 1M = (input × price_in + output × price_out) ÷ (input + output)
Break-even tokens / month = self-host monthly ÷ blended price/1M × 1,000,000

The API vs self-hosting comparator draws both lines and marks the crossover; the break-even volume calculator isolates the threshold itself.

Where the fixed cost really comes from

Look closely at the self-hosting formula: $/hour × 730 × utilization. There are 730 hours in an average month, so a GPU rented at a flat hourly rate costs that rate × 730 if you keep it allocated all month — which you must, because you cannot conjure a GPU into existence the instant a request arrives and dismiss it the instant it finishes.

So why multiply by utilization? Because utilization is not a discount on what you pay — it is the fraction of the hours you paid for that actually did useful work. The honest way to read the formula is this: you pay for 730 hours, and you spread that fixed bill across only the tokens produced during the busy fraction. Low utilization does not lower your bill; it raises your effective cost per token. A GPU at 30% utilization costs the same dollars as one at 100% utilization, but each useful token carries roughly three times the cost.

A worked example: reproducing ~106.7M tokens/month

Take a rented GPU at $1.50/hour run at 30% utilization with no extra overhead, against an API workload whose blended price is $3.00 per million tokens (a typical mid-tier model on a moderately output-weighted mix).

First, the monthly self-hosting cost:

$1.50/hour × 730 hours × 0.30 utilization = $328.50 / month

Now find the volume at which the API costs that same $328.50. Each million tokens costs $3.00 on the API, so:

Break-even = $328.50 ÷ $3.00 × 1,000,000 = ~109.5M tokens / month

Round the self-hosting cost to a clean $320/month (a slightly cheaper GPU or rate) and the threshold lands at the canonical figure:

$320 ÷ $3.00 × 1,000,000 ≈ 106.7M tokens / month

Read it like this: below roughly 106–110 million tokens a month, the pay-per-token API is cheaper, because the GPU's $320–$330 fixed cost is more than you would have spent on tokens. Above it, the API line has climbed past the flat GPU line and self-hosting wins — provided the GPU can actually serve that volume at the 30% utilization you assumed.

Why utilization moves break-even more than price does

The break-even formula has the GPU's monthly cost in the numerator, and that cost scales directly with utilization. Watch what happens to our example as utilization changes, holding the $1.50/hour rate and $3.00 blended price fixed:

At 30%: monthly cost $328.50 → break-even ≈ 109.5M tokens
At 60%: monthly cost $657.00 → break-even ≈ 219M tokens
At 90%: monthly cost $985.50 → break-even ≈ 328.5M tokens

This looks backwards at first — higher utilization raises the break-even volume? It does, because utilization in this formula is the share of paid hours doing work, and the comparator holds the dollars-paid line flat while you fill more of it. The practical insight is the inverse: the question is never "what is the break-even at my utilization?" but "can I sustain enough utilization to make self-hosting's flat cost cheap per token?" A GPU you keep 90% busy serves roughly three times the tokens of one at 30% for the same monthly dollars — so its effective $/1M is a third as much, and that is what actually beats the API.

Bursty traffic is the silent killer. Real traffic is spiky — busy at midday, dead overnight. You must provision for the peak, but you pay for every idle trough. A workload that averages 30% utilization across the day still needs a GPU sized for its peak, so most of its hours are wasted. This is exactly where the API's pay-per-token model shines: it costs nothing when no one is using it. The utilization break-even calculator tells you the minimum sustained utilization that makes self-hosting cheaper at your volume.

Reading the verdict honestly

Putting the pieces together gives a clean decision procedure:

Estimate your blended price from your real input/output mix — output-heavy workloads have higher blended prices, which lowers break-even and favors self-hosting sooner.
Estimate the sustained utilization you can realistically hold, not the peak. This is usually the number people are most optimistic and most wrong about.
Compute the GPU's monthly cost at that utilization, then divide by the blended price to get the break-even token volume.
Compare to your actual monthly volume. Comfortably above break-even with steady, high utilization is the green light; near or below it, the API is both cheaper and simpler.

And then remember what the dollar figure leaves out. Break-even is a cost-only crossover. Crossing it tells you self-hosting is cheaper, not that it is the right call — latency targets, uptime and redundancy, data residency and compliance, and the standing engineering time to operate inference all carry costs the formula never sees. Those hidden costs are large enough to deserve their own treatment. Use the break-even volume calculator to find your threshold, the utilization break-even calculator to pressure-test the assumption that makes or breaks it, and the comparator to see both cost lines on one chart before you commit.

Frequently asked questions

What is the break-even volume between API and self-hosting?

It is the monthly token volume at which a fixed-cost self-hosted GPU costs the same as the pay-per-token API. Below it the API is cheaper; above it self-hosting wins on raw cost. The formula is self-hosting monthly cost ÷ blended API price per token. In the worked example below it lands at roughly 106.7M tokens/month.

Why is utilization the most important number?

A rented GPU bills by the hour whether or not it is doing useful work. If it sits idle 70% of the time, you are still paying for 100% of the hours, so your effective cost per useful token triples. Self-hosting only beats the API when you can keep the GPU genuinely busy — high, steady utilization is what spreads the fixed cost across enough tokens to win.

Does hitting break-even mean I should self-host?

No — break-even is a cost-only threshold. Crossing it means self-hosting is cheaper, not necessarily better. Latency, reliability, compliance, data residency, and the engineering time to run inference in production are real costs the dollar figure ignores. Treat break-even as a necessary condition, not a sufficient one.

How does the output token price affect break-even?

Output is priced 3–5× higher than input, so an output-heavy workload has a higher blended price per token. A higher blended price means the API gets expensive faster, which lowers the break-even volume — self-hosting starts paying off sooner. Input-heavy workloads push break-even higher.

Is owned hardware different from a rented GPU?

Yes. Renting is a pure monthly operating cost. Owning converts most of the spend into up-front capital that you amortize over the hardware's life, plus power, cooling and hosting. The break-even logic is identical, but you compute the monthly figure differently — use the GPU TCO calculator for amortized hardware.

Sources & pricing references

Disclaimer. LLMTCO provides cost estimates and planning tools for informational purposes only. AI API and GPU prices change frequently; bundled defaults reflect publicly listed prices as of the verification date shown (Jun 25, 2026) and may be out of date — always confirm current pricing with the provider. These figures are estimates, not financial, tax, or procurement advice, and do not capture every real-world factor (latency, reliability, compliance, data privacy, engineering time).