VRAM & Model-Fit Calculator

Q: What is the difference between "weights only" and "with overhead"?

Weights only is just the model parameters in memory — params × bytes-per-param. It is the theoretical floor and a useful headline number (70B fp16 = 140 GB). With overhead multiplies that by 1.2× to reserve space for the attention KV-cache and intermediate activations, which grow with context length and batch size. Always size your GPU to the with-overhead figure; the weights-only number will not actually run.

🛠️ Free toolUpdated Jun 25, 2026By Francesco Zinghinì

Will an open-weight model fit on your GPU? Enter the parameter count and pick a precision, and this checker shows the VRAM needed at fp16, int8 and int4 — both the raw weights and the realistic figure with KV-cache and activation headroom — then lists exactly which GPUs can hold it and how many you would need. Numbers update as you type. Specs verified Jun 25, 2026 — sources; every field is editable.

int4 (selected) needs42 GB

Weights only35 GB

Headroom factor1.2×

VRAM required by precision (weights only vs with 1.2× runtime headroom)
Precision	Bytes/param	Weights only	With overhead
fp16	2.0	140 GB	168 GB
int8	1.0	70 GB	84 GB
int4	0.5	35 GB	42 GB

Which GPUs fit the model at the selected precision (42 GB needed)
GPU	VRAM	Cards needed	Fits on one?
NVIDIA A100 80GB	80 GB	1×	✅ yes
NVIDIA H100 80GB	80 GB	1×	✅ yes
NVIDIA L40S 48GB	48 GB	1×	✅ yes
NVIDIA RTX 4090 24GB	24 GB	2×	❌ no (shard)

"Weights only" will not run. The raw-weight figure (e.g. 70B fp16 = 140 GB) is the theoretical floor. In practice the KV-cache and activations need room too, and they grow with context length and batch size. Always provision to the with-overhead column. Long contexts or large batches can push real usage even past the 1.2× factor used here.

How it works

The dominant consumer of GPU memory is the model weights, and their size is delightfully predictable: it is just the number of parameters multiplied by how many bytes each parameter occupies. At full fp16 precision that is two bytes per parameter, so a 70-billion-parameter model is exactly 140 GB of weights. Quantization shrinks the bytes-per-parameter — one byte for int8, half a byte for int4 — and the weight footprint shrinks proportionally. This is the headline "weights only" number, and it is the right figure to quote when comparing model sizes.

But weights are not the only thing that lives in VRAM at inference time. The attention mechanism caches keys and values for every token in the context (the KV-cache), and the forward pass produces intermediate activations; both scale with how long your prompts are and how many requests you batch together. To keep the sizing simple and safe, this checker multiplies the weight footprint by a fixed 1.2× headroom factor — a reasonable default for moderate context and batch. The result is the "with overhead" figure, and it is the number you should actually size a GPU against. If you run very long contexts or aggressive batching, budget even more.

Once we know the with-overhead requirement, fitting is arithmetic: for each GPU, divide the requirement by the card's memory and round up to get the number of cards needed. One card with enough memory is the happy path. When a model is too big for a single GPU you must shard it across several, which is entirely workable but adds inter-GPU communication and orchestration overhead — a reason to prefer a more aggressive quantization or a larger card if one is available. The fit table makes the trade-off concrete for every GPU in the dataset.

Formulas.
Weights only (GB) = parameters (B) × bytes-per-param (fp16=2, int8=1, int4=0.5)
With overhead (GB) = weights only × 1.2 (KV-cache + activations headroom)
Cards needed = ⌈ with-overhead VRAM ÷ GPU VRAM ⌉

A worked example

Take a 70B model (the default, Llama-3 70B):

fp16 weights: 70 × 2 = 140 GB; with 1.2× overhead ≈ 168 GB
int8 weights: 70 × 1 = 70 GB; with overhead ≈ 84 GB
int4 weights: 70 × 0.5 = 35 GB; with overhead ≈ 42 GB

So at int4 the model needs about 42 GB to serve — which fits on a single 80 GB A100 or H100 with room to spare, but would need two 24 GB RTX 4090s. At fp16 the same model needs roughly 168 GB, comfortably beyond any single card in the dataset, so you would shard it across two 80 GB GPUs or quantize. This is the everyday calculus of self-hosting: quantization is often the difference between one cheap card and a multi-GPU rig.

Once a model fits, the next questions are speed and cost. Turn throughput into a price with the throughput cost calculator, size the all-in cost of owning the card with the GPU TCO calculator or renting it with the cloud GPU cost calculator, and decide whether to self-host at all with the API vs self-hosting comparator. Browse cards in the GPU pricing dataset, models in the open-weight model dataset, and the full sizing derivation in the methodology.

Frequently asked questions

How much VRAM does a 70B model need?

The raw weights of a 70-billion-parameter model are 70 × bytes-per-parameter: 140 GB at fp16 (2 bytes), 70 GB at int8 (1 byte), and 35 GB at int4 (0.5 byte). On top of that you must reserve room for the KV-cache and activations — we add a 1.2× headroom factor — so to actually serve it you should provision about 168 / 84 / 42 GB respectively.

What is the difference between "weights only" and "with overhead"?

Weights only is just the model parameters in memory — params × bytes-per-param. It is the theoretical floor and a useful headline number (70B fp16 = 140 GB). With overhead multiplies that by 1.2× to reserve space for the attention KV-cache and intermediate activations, which grow with context length and batch size. Always size your GPU to the with-overhead figure; the weights-only number will not actually run.

How does quantization reduce VRAM?

Quantization stores each weight in fewer bits. fp16 uses 2 bytes per parameter, int8 uses 1, int4 uses 0.5 — so int4 needs a quarter of the memory of fp16. That is what lets a 70B model that needs 168 GB at fp16 squeeze into about 42 GB at int4, often onto a single GPU. The trade-off is a usually-small quality loss that you should validate for your task.

Which GPUs can run this model?

Any GPU (or combination) whose total VRAM meets the with-overhead requirement. The fit table below lists each card in our dataset and how many you would need to hold the model at the selected quantization. A model that needs more memory than one card has must be sharded across several — which adds inter-GPU communication overhead and complexity.

Does fitting the model guarantee good performance?

No. Fitting is necessary but not sufficient. Once the model fits, throughput depends on the GPU's compute and memory bandwidth, your batch size, and the framework. Use the throughput cost calculator to turn measured tokens-per-second into a cost per token, and the throughput planner to size capacity for a request rate.

Where do the model sizes and GPU specs come from?

Parameter counts come from the published model cards and GPU memory from the manufacturer specs, verified on Jun 25, 2026 with sources linked. They are convenience defaults — params and quantization are editable — so the checker stays correct as new models and cards appear. See the open-weight model dataset and GPU dataset.

Sources & pricing references

Disclaimer. LLMTCO provides cost estimates and planning tools for informational purposes only. AI API and GPU prices change frequently; bundled defaults reflect publicly listed prices as of the verification date shown (Jun 25, 2026) and may be out of date — always confirm current pricing with the provider. These figures are estimates, not financial, tax, or procurement advice, and do not capture every real-world factor (latency, reliability, compliance, data privacy, engineering time).