Self-Hosting Cost Per Token Calculator

Turn a GPU hourly rate and a measured throughput into the number that actually matters: cost per million tokens. The twist most calculators miss is utilization — a GPU that sits idle half the time produces half the tokens for the same bill, so its cost per token doubles. This tool shows the price at your utilization and at 100/50/25% for context. Numbers update as you type. Rate verified Jun 25, 2026 — sources; every field is editable.

GPU
Performance
Cost / 1M tokens$0.2083
≈ rounded$0.21
Cost per 1M tokens by utilization (idle time multiplies cost)
UtilizationEffective tok/s$/1M tokens
100%2,000$0.2083
50%1,000$0.4167
25%500$0.8333

The context row uses your current rate and throughput, varying only utilization, so you can see exactly how much idle time costs.

Utilization is the hidden multiplier. Halve utilization and you double the cost per token; quarter it and you quadruple it. A GPU that looks cheap at 100% can be more expensive per token than the API once real-world idle time is included. Find the duty cycle you actually need to beat the API with the utilization break-even tool.

How it works

A GPU charges you for wall-clock time, but you care about tokens. To bridge the two we convert the hourly rate into a cost per second and ask how many tokens that second buys. The number of tokens per second is your throughput — but only while the GPU is actually generating. Across a real workload there are gaps: requests arrive unevenly, batches drain, the model waits on input. Utilization captures that, scaling raw throughput down to effective throughput. Divide the per-second cost by effective throughput and scale to a million tokens, and you have a unit price you can compare directly against any API.

The crucial, counter-intuitive consequence is that idleness is expensive in a way that is easy to overlook. The hourly bill does not drop when the GPU is idle — it keeps ticking — but the token output does drop. So every idle second is paid-for capacity producing nothing, and its cost gets loaded onto the tokens you did produce. That is why the same card can post a wonderful cost per token in a benchmark (pinned at 100% with perfect batching) and a dreadful one in production (40% utilization with bursty traffic). The benchmark number is a floor you will rarely touch.

Because utilization matters so much, this tool never reports a single number in isolation. It shows your chosen utilization plus a reference grid at 100%, 50%, and 25%, so the penalty for idle time is impossible to miss. Use the 100% row as the theoretical best case, and pick the row closest to your real duty cycle for budgeting. If you do not yet know your utilization, the honest move is to assume it is lower than you hope and plan accordingly.

Formulas.
Effective throughput = throughput (tok/s) × utilization
Cost per second = hourly rate ÷ 3,600
Cost per 1M tokens = (hourly ÷ 3,600) ÷ effective throughput × 1,000,000
Halving utilization doubles the cost per token; quartering it quadruples it.

A worked example

Using the defaults — a GPU at $1.5/hour producing 2,000 tokens/second at 100% utilization:

  • Cost per second: $1.5 ÷ 3,600 = $0.0004167/s
  • Effective throughput at 100%: 2,000 tok/s
  • Cost per 1M: $0.0004167 ÷ 2,000 × 1,000,000 = $0.2083/1M ≈ $0.21
  • At 50% utilization: effective 1,000 tok/s → $0.4167/1M
  • At 25% utilization: effective 500 tok/s → $0.8333/1M

The jump from $0.2083 to $0.8333 for the same hardware — a 4× swing — comes entirely from utilization, not from any change in price or speed. That single fact reshapes most self-hosting decisions: the question is rarely "is the GPU fast enough?" and almost always "can I keep it busy enough?". Compare your per-token figure against the blended API price in the API vs self-hosting comparator, find the crossover volume with the break-even volume tool, and get the all-in hourly cost for owned hardware from the GPU TCO calculator before feeding it back in here.

To size how many GPUs a target request rate needs, use the throughput planner; to check a model even fits the card, use the VRAM fit checker. Current rates live in the GPU pricing dataset, model sizes in the open-weight model dataset, and full derivations in the methodology.

Frequently asked questions

How do you get a cost per million tokens from a GPU hourly rate?

Convert the hourly rate to a per-second rate ($/hour ÷ 3,600), then divide by the tokens produced per second. Effective throughput is the raw tok/s times utilization, because idle seconds produce no tokens but still cost money. So $/1M = (hourly ÷ 3,600) ÷ (tok/s × utilization) × 1,000,000. At the defaults that is $0.2083 per 1M tokens ≈ $0.21.

Why does low utilization multiply my cost per token?

Because the GPU bills by the clock, not by the token. If it only spends half its seconds generating, it produces half the tokens for the same hourly cost — so the cost per token doubles. At 100% utilization the defaults give $0.2083/1M; at 50% they become $0.4167/1M, and at 25% they balloon to $0.8333/1M. Idle time is the single biggest hidden cost of self-hosting.

What counts as "throughput" here?

Sustained generation throughput: the tokens per second your serving stack actually produces under realistic load, summed across all concurrent requests (batching dramatically increases it). It is not the single-stream decode speed a user perceives. Measure it on your own model, GPU, quantization, and framework — it varies by an order of magnitude, so a borrowed number will mislead you.

How does this compare to the API price per token?

Directly. Once you have a self-hosted $/1M figure, set it beside the blended API price for the same workload. If self-hosting per-token is higher, the API wins until your volume and utilization improve. The API vs self-hosting comparator does this side by side, and the break-even volume tool finds the crossover.

Does this include electricity or just the rental rate?

It uses whatever hourly figure you enter. For a rented cloud GPU, that rate already bakes in power, so the result is complete. For owned hardware, first compute an all-in effective hourly cost (amortization + electricity + overhead, divided by hours) in the GPU TCO calculator, then bring that number here.

Are these rates current?

The bundled GPU rate was verified on Jun 25, 2026 and links to its source. It is a convenience default only — the hourly field is editable, so the calculator stays correct even when a default goes stale. Always confirm current pricing with the provider and benchmark throughput yourself.

Disclaimer. LLMTCO provides cost estimates and planning tools for informational purposes only. AI API and GPU prices change frequently; bundled defaults reflect publicly listed prices as of the verification date shown (Jun 25, 2026) and may be out of date — always confirm current pricing with the provider. These figures are estimates, not financial, tax, or procurement advice, and do not capture every real-world factor (latency, reliability, compliance, data privacy, engineering time).