LLM Throughput Planner (GPUs Needed)

Plan the GPU fleet behind a self-hosted LLM. Enter your request rate, the tokens generated per request, and the throughput of one GPU, and get the aggregate tokens/second you must produce and the number of GPUs needed to keep up. Sizing is in tokens/second — the unit that actually loads an inference server — so it stays honest as request shapes change. Numbers update as you type. GPU specs verified Jun 25, 2026 — sources; every field is editable.

Demand
Per-GPU capacity
You need 5 GPUs to serve 10,000 aggregate tokens/second at 2,000 tok/s per GPU. With 30% peak headroom, plan for 7 GPUs.
Aggregate throughput10,000 tok/s
GPUs needed (base)5
GPUs at peak (+30%)7
Per-GPU throughput2,000 tok/s
GPUs needed as request rate grows (at 2,000 tok/req, 2,000 tok/s/GPU)
Requests / secAggregate tok/sGPUs needed
1 2,000 1
2 4,000 2
5 10,000 5
10 20,000 10
20 40,000 20
50 100,000 50

GPU count rounds up — partial cards do not exist — so the curve steps rather than slopes. The table re-renders as you change the request size or per-GPU throughput above.

How it works

Sizing an inference fleet starts with the right unit. It is tempting to think in requests per second, but a request is not a fixed amount of work: one that generates a 200-token snippet and one that generates a 4,000-token essay load the GPU completely differently. The honest unit of demand is tokens per second — specifically, generated (output) tokens, since generation is the throughput bottleneck on an autoregressive model. So the first step is to convert your request rate into an aggregate token rate: multiply requests per second by the tokens each request produces. That product is the total work your fleet must deliver every second.

The second step is supply. A single GPU, running a given model on a given serving stack, sustains some number of tokens per second under realistic batching. Divide the demand by that per-GPU supply and you get the number of GPUs the workload requires. Because you cannot deploy a fraction of a card, you round up: 4.1 GPUs of demand is 5 physical GPUs. That rounding is not a rounding error — it is the reason fleet cost steps up in discrete jumps as traffic grows, and why a workload sitting just above a GPU boundary is paying for capacity it barely uses.

The third step, the one most plans skip, is headroom. The base calculation sizes for average load, but production traffic arrives in bursts, requests queue, and nodes occasionally fail. A fleet pinned at 100% utilization has no slack: the moment demand spikes, latency balloons and requests time out. Sizing for the peak rate — adding a margin on top of the average — buys the slack that keeps tail latency under control. The trade-off is direct: more headroom means more idle GPUs you still pay for, less headroom means more risk of degradation when traffic surges. This planner shows both the base and a headroom-adjusted figure so you can pick the point that matches your latency budget.

Formula.
Aggregate tokens/sec = requests/sec × tokens per request
GPUs needed = ⌈ aggregate tokens/sec ÷ tokens/sec per GPU ⌉  (round up)
GPUs at peak = ⌈ (requests/sec × (1 + headroom%)) × tokens per request ÷ tokens/sec per GPU ⌉
Throughput is generated (output) tokens; measure per-GPU throughput on your own stack.

A worked example

Using the defaults — 5 requests/second, 2,000 tokens generated per request, and GPUs that each sustain 2,000 tokens/second:

  • Aggregate demand: 5 × 2,000 = 10,000 tokens/second
  • GPUs needed: 10,000 ÷ 2,000 = 5.0, rounded up to 5 GPUs
  • With 30% peak headroom: 5 × 1.30 = 6.5 req/s → 13,000 tok/s → 7 GPUs
  • Double the per-request tokens to 4,000: demand doubles, GPUs needed rises to 10

The example makes the levers explicit: at this default the demand lands exactly on a GPU boundary (5.0), so there is no rounding waste — but nudge the request rate to 5.1 and you would still need a sixth card to stay under capacity. That sensitivity is the whole reason to size in tokens/second and round up deliberately. Once you have a GPU count, turn it into a running cost with the throughput cost calculator using rates from the GPU pricing table, and check whether self-hosting beats the API at your volume on the token cost calculator. The full method is on the methodology page.

Frequently asked questions

How many GPUs do I need to serve my request rate?

Divide the aggregate tokens per second you must produce by what one GPU sustains, then round up. With the defaults — 5 requests/second × 2,000 tokens/request = 10,000 tokens/second, served by GPUs that each do 2,000 tokens/second — you need 5 GPUs. Rounding up matters: you cannot run a fraction of a GPU, so 4.1 GPUs of demand means 5 physical cards.

What is "aggregate tokens per second" and why does it matter?

It is your total generation throughput requirement: request rate × tokens generated per request. This single number — 10,000 tok/s for the defaults — is the demand your fleet must meet. GPU sizing is then just demand ÷ per-GPU supply. Working in tokens/second (not requests/second) is essential because a request that generates 4,000 tokens stresses the fleet twice as hard as one generating 2,000, even at the same request rate.

Should I add headroom for traffic peaks?

Almost always. The base figure sizes for average load; real traffic is bursty, and a GPU running at 100% has no slack for spikes, queueing, or a node failing. A common practice is to size for the peak rate, not the mean — e.g. adding 30% headroom to the default rate raises the requirement to 7 GPUs. Decide your target by how much latency you are willing to let degrade at peak versus how much idle capacity you are willing to pay for.

What does "tokens per second per GPU" depend on?

On the model size, the GPU, the quantization, the batch size, and the sequence lengths. A small quantized model on a fast GPU with good batching can push many thousands of tokens/second; a large model at full precision pushes far fewer. The 2,000 tok/s default is a mid-range placeholder — measure your own serving stack (vLLM, TGI, TensorRT-LLM) under realistic batching and enter the real figure for an accurate plan.

Does this tell me the cost?

Not directly — it sizes the fleet in GPU count. Multiply the GPU count by an hourly rate to get the running cost: see current rates on the GPU pricing table and compare self-hosting against the API with the throughput cost calculator. Sizing first, then pricing, keeps the two decisions clean.

How current are these figures?

The throughput and per-GPU defaults are illustrative planning values, not measured benchmarks — every field is editable so you can drop in numbers from your own load tests. The GPU specs and prices referenced on the site were verified on Jun 25, 2026. Always validate per-GPU throughput on your actual model and serving stack.

Disclaimer. LLMTCO provides cost estimates and planning tools for informational purposes only. AI API and GPU prices change frequently; bundled defaults reflect publicly listed prices as of the verification date shown (Jun 25, 2026) and may be out of date — always confirm current pricing with the provider. These figures are estimates, not financial, tax, or procurement advice, and do not capture every real-world factor (latency, reliability, compliance, data privacy, engineering time).