Open-Weight LLM Models Table
Popular open-weight LLMs with their parameter count, license, and the GPU memory (VRAM) the weights need at three precisions — full fp16, int8 and int4. The VRAM figures are weights-only (they exclude KV-cache and runtime overhead), so they show the floor you must clear before a model will load at all. Prices and specs as of Jun 25, 2026.
| Model | Params (B) | License | VRAM fp16 | VRAM int8 | VRAM int4 | Source |
|---|---|---|---|---|---|---|
| Llama-3 70B | 70 | Llama Community | 140 GB | 70 GB | 35 GB | model ↗ |
| Mixtral 8x7B | 47 | Apache-2.0 | 94 GB | 47 GB | 24 GB | model ↗ |
| Llama-3 8B | 8 | Llama Community | 16 GB | 8 GB | 4 GB | model ↗ |
| Qwen2 7B | 7 | Apache-2.0 | 14 GB | 7 GB | 4 GB | model ↗ |
Need to know which GPU actually fits, including KV-cache and headroom? Use the VRAM model-fit calculator, then price the hardware with the GPU cloud cost tool.
From parameters to VRAM
The dominant memory cost of a model is its weights, and weights memory is almost purely arithmetic: VRAM ≈ parameters × bytes-per-parameter. At fp16 each parameter takes 2 bytes, so a 7-billion-parameter model needs about 14 GB just to hold the weights; a 70B model needs about 140 GB, which already exceeds a single 80 GB card. The numbers in this table use that formula directly (weights only), which is why they scale linearly with the parameter count.
The quantization trade-off
Quantization stores each weight in fewer bits, which is the single biggest lever for fitting a model on smaller hardware. int8 halves the fp16 footprint (1 byte per parameter); int4 halves it again (0.5 bytes). That is how a 70B model that needs ~140 GB at fp16 drops to ~35 GB at int4 — the difference between a multi-GPU server and a single high-end card. The cost is quality: lower precision introduces small rounding errors that can slightly degrade output, more noticeably below 4 bits. For most inference workloads int8 is nearly lossless and int4 is a very good deal; go lower only after testing on your own prompts.
Where the data comes from
Parameter counts and licenses come from each model's official card or repository (linked in the Source column). The VRAM columns are computed, not measured: they apply the bytes-per-parameter rule above with no runtime overhead, so they are deliberately conservative lower bounds. Real deployments also need memory for the KV cache (which grows with context length and batch size), activations, and the inference framework itself — typically another 15–30% on top. Plan for that headroom rather than provisioning to the exact figure here.
Frequently asked questions
Do these VRAM numbers include the KV cache?
No. The table is weights-only so it scales cleanly with parameters. Add roughly 15–30% for the KV cache, activations and framework overhead — and more for long contexts or large batch sizes. The VRAM model-fit calculator adds that headroom for you.
Will int4 hurt my output quality?
Usually only slightly. int8 is close to lossless for most tasks; int4 is a strong default that many production deployments use. Below 4 bits the degradation becomes more visible. Always benchmark a quantized model on your own prompts before committing.
Why does a 70B model need more than one 80GB GPU at fp16?
140 GB of weights does not fit in 80 GB. At fp16 you would shard across two cards (or pick a bigger-memory configuration); quantizing to int8 (~70 GB) or int4 (~35 GB) lets it fit on a single card with room to spare.
Are mixture-of-experts models sized the same way?
The VRAM here is based on total parameters, which is the right figure for memory: an MoE model must hold all expert weights in VRAM even though only some are active per token. So memory tracks total params, while compute tracks the smaller active-parameter count.
Disclaimer. LLMTCO provides cost estimates and planning tools for informational purposes only. AI API and GPU prices change frequently; bundled defaults reflect publicly listed prices as of the verification date shown (Jun 25, 2026) and may be out of date — always confirm current pricing with the provider. These figures are estimates, not financial, tax, or procurement advice, and do not capture every real-world factor (latency, reliability, compliance, data privacy, engineering time).