Question 1

How many GPUs do I need to serve my request rate?

Accepted Answer

Divide the aggregate tokens per second you must produce by what one GPU sustains, then round up. With the defaults — 5 requests/second × 2,000 tokens/request = 10,000 tokens/second, served by GPUs that each do 2,000 tokens/second — you need 5 GPUs. Rounding up matters: you cannot run a fraction of a GPU, so 4.1 GPUs of demand means 5 physical cards.

Question 2

What is "aggregate tokens per second" and why does it matter?

Accepted Answer

It is your total generation throughput requirement: request rate × tokens generated per request. This single number — 10,000 tok/s for the defaults — is the demand your fleet must meet. GPU sizing is then just demand ÷ per-GPU supply. Working in tokens/second (not requests/second) is essential because a request that generates 4,000 tokens stresses the fleet twice as hard as one generating 2,000, even at the same request rate.

Question 3

Should I add headroom for traffic peaks?

Accepted Answer

Almost always. The base figure sizes for average load; real traffic is bursty, and a GPU running at 100% has no slack for spikes, queueing, or a node failing. A common practice is to size for the peak rate, not the mean — e.g. adding 30% headroom to the default rate raises the requirement to 7 GPUs. Decide your target by how much latency you are willing to let degrade at peak versus how much idle capacity you are willing to pay for.

Question 4

What does "tokens per second per GPU" depend on?

Accepted Answer

On the model size, the GPU, the quantization, the batch size, and the sequence lengths. A small quantized model on a fast GPU with good batching can push many thousands of tokens/second; a large model at full precision pushes far fewer. The 2,000 tok/s default is a mid-range placeholder — measure your own serving stack (vLLM, TGI, TensorRT-LLM) under realistic batching and enter the real figure for an accurate plan.

Question 5

Does this tell me the cost?

Accepted Answer

Not directly — it sizes the fleet in GPU count. Multiply the GPU count by an hourly rate to get the running cost: see current rates on the GPU pricing table and compare self-hosting against the API with the throughput cost calculator. Sizing first, then pricing, keeps the two decisions clean.

Question 6

How current are these figures?

Accepted Answer

The throughput and per-GPU defaults are illustrative planning values, not measured benchmarks — every field is editable so you can drop in numbers from your own load tests. The GPU specs and prices referenced on the site were verified on Jun 25, 2026. Always validate per-GPU throughput on your actual model and serving stack.

Requests / sec	Aggregate tok/s	GPUs needed
1	2,000	1
2	4,000	2
5	10,000	5
10	20,000	10
20	40,000	20
50	100,000	50

LLM Throughput Planner (GPUs Needed)

How it works

A worked example

Frequently asked questions