When does self-hosting actually pay off?
"Should we self-host?" is rarely a pure cost question, but cost is where it should start. This guide is a six-step decision framework: estimate your volume, price the API, size the GPU, estimate utilization, compute the break-even, and then weigh the things a calculator can't price. The honest answer for most teams, most of the time, is "not yet" — and knowing exactly when that flips is the point.
Step 1 — Estimate your real monthly token volume
Pull a representative month of traffic and add up input and output tokens separately. Two numbers matter beyond the total: the output share (output tokens are typically 3–5× more expensive than input, and slower to generate), and how steady the load is. A workload that runs flat around the clock is a candidate for self-hosting; one that spikes during business hours and goes quiet overnight is not, no matter how large its peak. Use the throughput planner to translate request rate into sustained tokens per second.
Step 2 — Price the API for that workload
API cost is pure marginal cost: input tokens × input price + output tokens × output price. Apply only the discounts you can realistically capture — batch (~50% off) for offline jobs, prompt caching (cached input at ~10–25% of price) for repeated prefixes. The result is your blended monthly API spend, and it is the number self-hosting must beat. Be conservative here: overstating the API bill is how teams talk themselves into a GPU they don't need.
Step 3 — Size the GPU and confirm the model fits
Decide which open-weight model and quantization you would actually serve, then check it fits in VRAM before pricing anything. VRAM ≈ params(B) × bytes-per-param × 1.2, where fp16 is 2 bytes, int8 is 1, and int4 is 0.5. A 70B model needs roughly 140GB in fp16 (multiple GPUs) but only about 42GB in int4 with overhead — which can change which GPU, and therefore which hourly rate, you're pricing. The VRAM model-fit calculator does this and tells you the GPU count needed.
Self-host/month = $/hour × 730 × utilization + overhead
Break-even tokens = self-host monthly ÷ blended $/1M × 1,000,000
Worked: $320/mo ÷ $3.00 per 1M × 1M ≈ 106.7M tokens/month
Step 4 — Estimate realistic utilization
This is the step where optimism kills the business case. Utilization is the fraction of each hour the GPU spends actually generating tokens. Interactive, business-hours, or bursty traffic rarely sustains more than 20–40% without aggressive queuing; only steady offline pipelines approach the high numbers. Because the GPU bills by the clock, halving utilization doubles your cost per token. Model your honest figure, not your hoped-for one.
Step 5 — Compute the cost break-even
Now combine the pieces. Self-hosting monthly cost is GPU $/hour × 730 × utilization + overhead; divide it by the blended API price per 1M tokens to get the break-even volume. There is a complementary view, the utilization break-even: the utilization at which a fixed API bill equals the GPU rental — API monthly ÷ ($/hour × 730). At $1.50/hour, a $750/month API bill is matched at about 68.5% utilization. If you can't realistically hit that, the API wins.
- If your volume is well above break-even and utilization is high → self-hosting likely wins on cost.
- If volume is below break-even, or utilization is low → the API almost certainly wins on cost.
- If you're near the line → the non-cost factors in step 6 should decide it.
Run your own numbers in the API vs self-hosting comparator, find the crossover with the break-even volume calculator, and pressure-test the utilization assumption with the utilization break-even tool.
Step 6 — Weigh the non-cost factors honestly
Cost is one axis, and it is the one a calculator can give you cleanly. The others you have to judge:
- Latency. Self-hosting can cut network hops, but an under-provisioned endpoint can be slower than an autoscaling API. It's a design outcome, not a guarantee.
- Compliance & data residency. Sometimes the deciding factor — if data cannot leave a jurisdiction or your infrastructure, self-hosting may be required regardless of cost.
- Privacy. Keeping prompts and completions in-house can matter for sensitive workloads.
- Reliability. A managed API has an SLA and an on-call team; self-hosting hands you the pager, redundancy, and failover.
- Engineering time. Building and operating a serving stack is real, recurring effort. Price it as overhead, and remember it competes with everything else your team could ship.
For most teams the right sequence is: start on the API, capture the easy discounts, watch the volume, and revisit this framework when your steady baseline crosses break-even. Self-hosting pays off when the math and the requirements agree — not a moment before.
A worked example, both directions
Take a team running a steady 80M input + 20M output tokens a month against a model priced near $3.00 per 1M blended. Their API bill is about $300 a month. They're eyeing a single GPU at $1.50/hour. At a realistic 30% utilization that GPU costs $1.50 × 730 × 0.30 ≈ $329 a month — slightly more than the API, before any overhead. The verdict is clear: stay on the API. To even match the $300 API bill they'd need utilization of $300 ÷ ($1.50 × 730) ≈ 27%, and they'd still be carrying the operational burden for no cost saving.
Now scale the same team to 400M input + 100M output tokens — five times the volume — at the same blended price. The API bill rises to about $1,500 a month, while the GPU, if they can now keep it busy at 80% utilization, costs $1.50 × 730 × 0.80 ≈ $876. Suddenly self-hosting saves roughly $600 a month, and the break-even (about 106.7M tokens at a $320 monthly GPU cost) is well behind them. The variable that flipped the answer wasn't the price of anything — it was volume and the utilization that volume makes achievable.
Reading the result without fooling yourself
- Stress-test utilization. Re-run the break-even at half your assumed utilization. If self-hosting only wins under your most optimistic number, treat the case as "not yet."
- Add real overhead. Put a dollar figure on monitoring, redundancy, updates, and on-call, and feed it into the monthly self-hosting cost. A bare hourly rate flatters self-hosting.
- Separate "cheaper" from "better." The calculators answer cost. Latency, compliance, privacy, and reliability are answered by your requirements, and any one of them can override a cost verdict in either direction.
- Revisit on a schedule. Volume grows, prices change, and quantization can shrink the GPU you need. A decision made at 50M tokens deserves a fresh look at 200M.
Self-hosting is neither a trophy nor a trap — it's a fixed-cost bet that pays off precisely when sustained volume lets you spread that cost across enough tokens, and when the non-cost requirements don't get in the way. Work the six steps with your own numbers in the comparator, and let the crossover, not the hype, make the call.
Frequently asked questions
What volume do I need before self-hosting beats the API?
It depends on your GPU cost and the API price, but the mechanism is fixed: self-hosting has a roughly flat monthly cost while the API grows per token, so they cross at a break-even volume. For a $320/month GPU setup against a $3.00 per 1M blended API price, break-even is about 106.7M tokens/month. Below that the API is cheaper; above it self-hosting pulls ahead — but only if you can keep the GPU busy.
Why does utilization decide everything?
Because the GPU bills by the clock whether or not it is generating tokens. A $1.50/hour GPU costs about $1,095/month at 100% utilization but the same $1,095 at 20% — you just produce one-fifth the tokens. You can also flip the question: at $1.50/hour, a $750/month API bill is matched at about 68.5% utilization. Below that the GPU sits idle too often to beat the API.
Is self-hosting ever worth it below break-even?
Yes — when a non-cost factor dominates. Strict data residency or privacy requirements, the need to run a model no API offers, guaranteed capacity, or full control over the stack can justify self-hosting even when it costs more per token. Just make that an explicit, eyes-open decision rather than an accident of optimistic volume math.
Does self-hosting reduce latency?
Not automatically. A co-located GPU can cut network round-trips, but a saturated or under-provisioned self-hosted endpoint can be far slower than a managed API that autoscales. Latency is a design outcome of your serving stack and headroom, not a freebie of self-hosting. Treat it as one of the non-cost factors to test, not assume.
What hidden costs does the break-even math leave out?
The dollar break-even captures the GPU bill plus any overhead you enter. It does not, by itself, price reliability engineering, redundancy for failover, monitoring, model updates, or the on-call burden. Fold a realistic monthly overhead figure into the calculation, and treat engineering time as a real line item — it is often what tips a marginal case back toward the API.
Disclaimer. LLMTCO provides cost estimates and planning tools for informational purposes only. AI API and GPU prices change frequently; bundled defaults reflect publicly listed prices as of the verification date shown (Jun 25, 2026) and may be out of date — always confirm current pricing with the provider. These figures are estimates, not financial, tax, or procurement advice, and do not capture every real-world factor (latency, reliability, compliance, data privacy, engineering time).