Question 1

How do you get a cost per million tokens from a GPU hourly rate?

Accepted Answer

Convert the hourly rate to a per-second rate ($/hour ÷ 3,600), then divide by the tokens produced per second. Effective throughput is the raw tok/s times utilization, because idle seconds produce no tokens but still cost money. So $/1M = (hourly ÷ 3,600) ÷ (tok/s × utilization) × 1,000,000. At the defaults that is $0.2083 per 1M tokens ≈ $0.21.

Question 2

Why does low utilization multiply my cost per token?

Accepted Answer

Because the GPU bills by the clock, not by the token. If it only spends half its seconds generating, it produces half the tokens for the same hourly cost — so the cost per token doubles. At 100% utilization the defaults give $0.2083/1M; at 50% they become $0.4167/1M, and at 25% they balloon to $0.8333/1M. Idle time is the single biggest hidden cost of self-hosting.

Question 3

What counts as "throughput" here?

Accepted Answer

Sustained generation throughput: the tokens per second your serving stack actually produces under realistic load, summed across all concurrent requests (batching dramatically increases it). It is not the single-stream decode speed a user perceives. Measure it on your own model, GPU, quantization, and framework — it varies by an order of magnitude, so a borrowed number will mislead you.

Question 4

How does this compare to the API price per token?

Accepted Answer

Directly. Once you have a self-hosted $/1M figure, set it beside the blended API price for the same workload. If self-hosting per-token is higher, the API wins until your volume and utilization improve. The API vs self-hosting comparator does this side by side, and the break-even volume tool finds the crossover.

Question 5

Does this include electricity or just the rental rate?

Accepted Answer

It uses whatever hourly figure you enter. For a rented cloud GPU, that rate already bakes in power, so the result is complete. For owned hardware, first compute an all-in effective hourly cost (amortization + electricity + overhead, divided by hours) in the GPU TCO calculator, then bring that number here.

Question 6

Are these rates current?

Accepted Answer

The bundled GPU rate was verified on Jun 25, 2026 and links to its source. It is a convenience default only — the hourly field is editable, so the calculator stays correct even when a default goes stale. Always confirm current pricing with the provider and benchmark throughput yourself.

Self-Hosting Cost Per Token Calculator

How it works

A worked example

Frequently asked questions