GPU sizing for LLM inference: the VRAM math
Before you can price a self-hosted model, you have to know whether it fits — and on how many GPUs. The arithmetic is short: parameters × bytes per parameter, plus about 20% headroom for the KV-cache and activations, compared against real GPU memory. This guide works through that math in five steps, explains where the overhead comes from, and shows how to split a model across multiple cards when it won't fit on one.
Step 1 — Get the parameter count
Everything starts with the parameter count, almost always stated in billions and often baked into the model's name: a "70B" model has 70 billion parameters, an "8B" has 8 billion. The model card is the authoritative source. Our open-weight model dataset lists popular models with their parameter counts so you don't have to hunt for them.
Step 2 — Pick a quantization
Parameters are stored at a chosen numeric precision, and precision sets the bytes per parameter:
- fp16 / bf16 — 2 bytes per parameter. The default for full-quality inference.
- int8 — 1 byte per parameter. Roughly half the VRAM, usually near-lossless.
- int4 — about 0.5 bytes per parameter. Roughly a quarter of fp16, with a small quality trade-off.
Quantization is the single biggest lever on VRAM. Dropping from fp16 to int4 cuts weight memory by 4×, which routinely turns a multi-GPU model into a single-GPU one. The cost is a modest, task-dependent quality drop — test it on your own workload.
Step 3 — Multiply parameters by bytes per parameter
Raw weight memory is just the product of the two numbers above:
Weights (GB) ≈ params(B) × bytes-per-param
70B fp16 = 70 × 2 = 140GB
70B int8 = 70 × 1 = 70GB
70B int4 = 70 × 0.5 = 35GB
This is the floor: the memory needed just to hold the model. It is not yet the number you size a GPU against, because serving needs working memory on top of the weights.
Step 4 — Add about 20% for KV-cache and activations
Two things consume VRAM beyond the weights. Activations are the intermediate tensors of a forward pass. The KV-cache stores the key/value tensors for every token already in each in-flight request, so generation doesn't recompute the whole context for every new token — it grows with context length and concurrency. A practical rule of thumb is to multiply weight memory by about 1.2 to cover both:
Required VRAM ≈ params(B) × bytes-per-param × 1.2
70B int4 = 35 × 1.2 ≈ 42GB → fits a single 48GB or 80GB GPU
70B fp16 = 140 × 1.2 ≈ 168GB → needs multiple GPUs
The 1.2 multiplier is a starting estimate for short prompts at modest concurrency. If you serve long contexts to many simultaneous users, the KV-cache can grow to rival or exceed the weights themselves, so budget it explicitly rather than trusting the flat 20%. Either way, the VRAM model-fit calculator applies this overhead for you and reports the requirement directly.
Step 5 — Pick a GPU, or split across several
Now compare the requirement to real GPU memory. If it fits on one card with headroom, you're done — and you should leave headroom rather than filling memory to the last gigabyte, because the KV-cache expands under load. If it doesn't fit, you split the model across GPUs:
GPUs = ceil(required VRAM ÷ per-GPU VRAM)
168GB across 80GB cards → ceil(168 ÷ 80) = 3 GPUs
42GB on a single 80GB card → 1 GPU
Multi-GPU serving uses tensor or pipeline parallelism to spread the weights and computation across cards, which adds inter-GPU communication overhead. So the goal is the smallest GPU count that fits comfortably — splitting more than necessary costs both money and throughput. Once you've chosen the GPU and count, the GPU pricing dataset gives the hourly rates to turn this into a monthly cost.
VRAM fit is necessary, not sufficient
- VRAM fit is a hard constraint: if the model plus its working memory doesn't fit, it won't run at all.
- Throughput is the separate question of how many tokens per second the GPU sustains, which sets your cost per token and your capacity ceiling.
Size for memory first using the steps above, then plan capacity and concurrency with the throughput planner. A model that barely fits in VRAM may have almost no room left for the KV-cache, throttling concurrency and throughput — so the two questions are linked, and you want headroom on both. Get the VRAM math right and the rest of the self-hosting analysis has a solid foundation to stand on.
How the KV-cache scales, in numbers
The flat 20% overhead is a convenience, not a law. The KV-cache size depends on the context length you serve and how many requests are in flight at once, and for long-context, high-concurrency workloads it can dwarf the weights. The cache stores a key and a value vector per token, per layer, across the model's attention heads, so it grows linearly with both the number of tokens held in context and the number of concurrent sequences.
- Short prompts, low concurrency — a chat assistant with a few hundred tokens of context per request. The KV-cache is small; the 1.2 multiplier is generous.
- Long context, high concurrency — a retrieval-augmented service stuffing tens of thousands of tokens into context for dozens of simultaneous users. The cache can equal or exceed the weights, and a flat 20% will badly under-size the GPU.
The practical takeaway: if your workload is long-context or highly concurrent, don't trust the rule of thumb. Estimate the cache from your real context length and concurrency, or measure peak VRAM under load, and leave explicit headroom on top. Under-sizing here doesn't just slow things down — it causes out-of-memory failures or forces the server to evict requests.
Choosing precision against your GPU budget
Because quantization scales weight memory linearly, it's usually the first knob to turn when a model doesn't fit. Walk the options against the hardware you can actually rent:
- fp16 when quality is paramount and you have the VRAM (or GPU count) to spare.
- int8 as the common middle ground — half the memory, usually near-lossless, often enough to drop from two GPUs to one.
- int4 when you need to squeeze a large model onto a single card and can tolerate a small quality cost — the lever that puts a 70B model on one 48GB or 80GB GPU.
Run the candidate model and quantization through the VRAM model-fit calculator to confirm the fit and the GPU count, cross-reference parameter counts in the open-weight model dataset, and price the resulting GPU choice with the GPU pricing dataset. Size, then fit, then price — in that order — and the self-hosting cost case rests on numbers you can defend.
A quick reference table in prose
To make the relationship concrete, here is how a few common model sizes land across precisions, using the params × bytes × 1.2 rule:
- 8B model: fp16 ≈ 19GB, int8 ≈ 10GB, int4 ≈ 5GB — fits a single mid-range GPU at any precision.
- 13B model: fp16 ≈ 31GB, int8 ≈ 16GB, int4 ≈ 8GB — fp16 wants a 40GB+ card; int8 and int4 fit comfortably on one.
- 70B model: fp16 ≈ 168GB (three 80GB GPUs), int8 ≈ 84GB (two), int4 ≈ 42GB (one).
These are weight-plus-overhead floors at modest context and concurrency; long-context, high-concurrency serving pushes them higher via the KV-cache. Treat the table as where sizing starts, then confirm against your real workload.
Frequently asked questions
How much VRAM does a 70B model need?
In fp16 the weights alone are 70 × 2 = 140GB, which needs two 80GB GPUs. Quantize to int4 and the weights drop to about 35GB; with the ~20% overhead for KV-cache and activations that lands near 42GB, which fits on a single 48GB or 80GB card. Quantization is often the difference between needing one GPU and needing several.
What is the KV-cache and why does it grow VRAM?
The KV-cache stores the key and value tensors for every token already in a request's context, so the model doesn't recompute them on each new token. It grows with context length and with the number of concurrent requests, and it lives in VRAM alongside the weights. For short prompts at low concurrency the ~20% overhead rule is fine; for long-context, high-concurrency serving the KV-cache can rival or exceed the weights, so size it explicitly.
Does quantization hurt quality?
Some, but less than people expect. int8 is usually near-lossless for inference; int4 typically costs a small, often acceptable, drop in quality in exchange for roughly a 4× VRAM reduction versus fp16. The right precision is a trade-off between fitting your GPU budget and the quality your application needs — test on your own task rather than trusting a benchmark headline.
When do I need more than one GPU?
Whenever the required VRAM (weights × 1.2, plus your KV-cache budget) exceeds a single card's memory. GPUs needed = ceil(required VRAM ÷ per-GPU VRAM). Multi-GPU serving uses tensor or pipeline parallelism, which adds communication overhead, so prefer the smallest GPU count that fits comfortably with headroom rather than splitting unnecessarily.
Is fitting in VRAM the same as being fast enough?
No. VRAM fit is a hard yes/no constraint — if the model doesn't fit, it won't run. Throughput is a separate question: how many tokens per second the GPU sustains under batched load determines your cost per token and whether you can meet demand. Size for VRAM first, then plan capacity with the throughput planner.
Disclaimer. LLMTCO provides cost estimates and planning tools for informational purposes only. AI API and GPU prices change frequently; bundled defaults reflect publicly listed prices as of the verification date shown (Jun 25, 2026) and may be out of date — always confirm current pricing with the provider. These figures are estimates, not financial, tax, or procurement advice, and do not capture every real-world factor (latency, reliability, compliance, data privacy, engineering time).