The hidden costs of self-hosting LLMs

✍️ Francesco ZinghinìUpdated Jun 25, 2026⏱️ 8 min

The pitch for self-hosting is seductively simple: a capable GPU rents for a couple of dollars an hour, so surely it beats the API once you scale. The flaw is in the words "an hour." The hourly rate is the one cost that is easy to see, and it is almost never the cost that decides the comparison. This guide walks through the costs that hide behind the sticker rate — idle capacity, engineering time, reliability and redundancy — and shows why the gap between $/hour and the real per-token cost is where most self-hosting business cases quietly fall apart.

The sticker rate is the floor, not the cost

When a cloud GPU is advertised at, say, $1.50/hour, that number describes a single thing: the price of one hour of allocation. It says nothing about how many useful tokens you will get out of that hour, how many hours will be wasted, or what it costs to keep the thing running in production. The API, by contrast, quotes you the all-in price of the only thing you care about — a token served. Comparing an API's per-token price to a GPU's per-hour price is comparing a finished number to a raw ingredient.

To make the comparison fair, you have to build the GPU's hourly rate up into a true monthly total and then divide by the tokens you actually serve. Three categories of cost sit between the two:

Idle capacity — hours you pay for but do not use.
Operations — the engineering time to deploy, monitor, and maintain inference.
Reliability — the redundant capacity needed to match an API's uptime.

Idle GPU time: the largest hidden cost

A GPU bills continuously. Whether it is saturated with requests or sitting at a blinking cursor, the meter runs at the same rate. But real traffic is never flat — it peaks at midday and collapses overnight, spikes with a product launch and idles on weekends. You must size for the peak, yet you pay for every trough.

The result is that most self-hosted inference runs at low utilization — the fraction of paid hours doing useful work. And utilization maps directly onto cost per token:

Effective cost per useful token.
Self-host $/1M = ($/hour ÷ 3600) ÷ (tokens_per_sec × utilization) × 1,000,000

Example: $1.50/hour, 2,000 tok/s, 100% utilization → ≈ $0.21 / 1M
Same GPU at 30% utilization → ≈ $0.69 / 1M (over 3× higher)

The arithmetic is unforgiving: the cost the GPU can theoretically hit assumes it never stops working. Drop to 30% utilization — entirely normal for bursty traffic — and the effective cost per token more than triples, because you are spreading the same fixed bill across a third of the tokens. The throughput cost calculator turns tokens-per-second and hourly rate into the real $/1M, and the utilization break-even calculator tells you the sustained utilization you must hold for the deal to make sense at all.

This is precisely the cost the API does not have. Pay-per-token means an idle hour costs nothing. The API's "expensive" per-token price already bakes in the provider's own utilization — achieved by pooling thousands of customers onto shared hardware, a scale of load-smoothing a single team cannot replicate.

Engineering time: a standing cost, not a setup fee

Someone has to stand up the inference server, choose and configure a serving framework, wire up autoscaling, set up monitoring and alerting, and then — forever after — keep it healthy. That last part is the cost people forget. Self-hosted inference needs:

Monitoring and alerting on latency, throughput, GPU memory and error rates.
Driver, CUDA, framework and model-version upgrades, each with its own regression risk.
Capacity planning as traffic grows, and scaling events when it spikes.
Incident response and on-call when a node falls over at 3 a.m.

Engineering time is expensive. A senior engineer's fully loaded cost can run well into five figures per month; even a fraction of one person's time devoted to keeping inference alive can exceed the entire GPU bill at modest scale. This is why the GPU TCO calculator includes an operations overhead input — it is not a rounding error, it is frequently the line item that flips the verdict. The API absorbs all of this work into its price; you write zero monitoring dashboards for a token endpoint.

Reliability and redundancy: paying twice to match free uptime

A managed API gives you redundancy, failover, and a published SLA at no visible cost — it is part of the per-token price. A single self-hosted GPU gives you none of that. It is one machine, and when it fails, your inference is down.

Matching the resilience you got for free from the API means running spare capacity: a second GPU on warm standby, a load balancer in front, health checks, and a deployment process that can fail over without dropping requests. Redundancy roughly doubles the fixed cost while serving the same token volume — which roughly doubles the effective cost per token. High-availability setups (multiple replicas across zones) multiply it further. None of this serves a single additional token; it only buys you the uptime the API included by default.

The compounding trap. These costs stack. Take the theoretical $0.21/1M at full utilization, drop to 30% utilization (≈ $0.69/1M), add a standby replica for reliability (≈ $1.38/1M), then layer in a slice of engineering time as monthly overhead — and the "cheap" self-hosted token has quietly drawn level with, or passed, the API price you were trying to beat. Each factor alone is modest; together they close the gap.

Closing the gap between $/hour and reality

None of this means self-hosting is a bad idea — at genuinely high, steady volume with strong utilization and amortized hardware, it can be dramatically cheaper, and there are non-cost reasons (data residency, latency control, model customization) that justify it regardless. The point is to compare honestly. A defensible self-hosting estimate is built like this:

Start from the hourly rate and convert it to a monthly cost over 730 hours.
Apply realistic sustained utilization, not the peak — this captures idle time and is usually the dominant correction.
Add operations overhead as a monthly figure for the engineering time inference actually consumes.
Add redundancy by multiplying fixed capacity to match your uptime target.
Divide by tokens actually served to get the true effective $/1M, and only then compare it to the API.

Do that, and the comparison stops being "cheap GPU versus expensive API" and becomes an honest contest between two fully loaded numbers. Use the GPU TCO calculator to assemble the full monthly cost including overhead, the throughput cost calculator to convert that into a real per-token figure, and the utilization break-even calculator to find the utilization floor below which the API simply wins.

Frequently asked questions

Why is the GPU hourly rate not the real cost of self-hosting?

The hourly rate is what you pay; the real cost is what each useful token costs after idle time, operations, and redundancy are folded in. A GPU billed at $1.50/hour that sits idle 70% of the time, needs a standby twin, and consumes engineering hours can easily cost 3–5× its sticker rate per token actually served. The rate is the floor, not the figure.

What is the biggest hidden cost of self-hosting?

Idle GPU time, by a wide margin. A rented or owned GPU bills continuously, but real traffic is bursty, so most deployments run far below full utilization. At 30% utilization you pay for three hours to get one hour of useful work — tripling the effective cost per token before any other overhead is counted.

How much engineering time should I budget for self-hosting?

It is a standing cost, not a one-off. Beyond the initial setup, expect ongoing hours for monitoring, version and driver upgrades, capacity planning, incident response, and on-call. Even a fraction of one engineer's loaded monthly cost can exceed the entire GPU bill at modest scale — which is why it belongs in the TCO, entered as monthly overhead.

Does the comparator include these hidden costs?

Partly. Idle time is captured directly through the utilization input, and you can enter a monthly operations overhead figure for DevOps and redundancy. Latency, reliability and compliance trade-offs are flagged as caveats in the verdict rather than turned into dollar figures, because they are too situational to estimate generically.

How do redundancy and reliability change the math?

A single GPU is a single point of failure. Matching the uptime an API gives you for free usually means a second (or third) GPU on standby, plus a load balancer and health checks. Redundancy can double the fixed cost while serving the same token volume — which roughly doubles the effective cost per token.

Sources & pricing references

Disclaimer. LLMTCO provides cost estimates and planning tools for informational purposes only. AI API and GPU prices change frequently; bundled defaults reflect publicly listed prices as of the verification date shown (Jun 25, 2026) and may be out of date — always confirm current pricing with the provider. These figures are estimates, not financial, tax, or procurement advice, and do not capture every real-world factor (latency, reliability, compliance, data privacy, engineering time).