Quantization and cost: int8/int4 economics

Q: How much VRAM does a quantized model need?

Roughly parameters (in billions) × bytes per parameter × 1.2 for overhead. Bytes per parameter are 2 for fp16, 1 for int8, and 0.5 for int4. So a 70B model is about 140GB in fp16, 70GB in int8, and 42GB in int4 once the 1.2× headroom is included. The VRAM & model-fit calculator works this out for any model and GPU.

✍️ Francesco ZinghinìUpdated Jun 25, 2026⏱️ 8 min

Quantization is the single most effective cost lever in self-hosted inference, and it works by attacking the one resource that decides which GPU you have to rent: VRAM. By storing each model weight in fewer bits, quantization can shrink a model enough to drop it onto a cheaper GPU tier, free up a second card, and raise throughput at the same time — all of which lower the cost per token. This guide stays strictly on the economics: the bytes-per-parameter math, how it reshapes GPU choice and throughput, and when the savings are large enough to flip the API-vs-self-hosting verdict. Quality is a caveat we will flag, not the subject.

Why VRAM is the cost that matters

For self-hosted inference, the GPU you must rent is dictated first and foremost by whether the model fits in its memory. A model that needs 140GB will not run on an 80GB card no matter how cheap that card is per hour — you are forced up to multiple cards or a larger instance, and the fixed monthly cost jumps accordingly. Conversely, anything that shrinks the model's memory footprint can move you down a GPU tier, and GPU tiers are where the big money lives. Quantization is the most direct way to do that shrinking.

The bytes-per-parameter math

A model's weight memory is just its parameter count multiplied by how many bytes each parameter takes, plus headroom for activations, the KV cache, and runtime overhead. The estimate is:

VRAM estimate.
VRAM ≈ parameters (in billions) × bytes per parameter × 1.2 (overhead)

Bytes per parameter: fp16 = 2 | int8 = 1 | int4 = 0.5

70B model, fp16: 70 × 2 × 1.2 ≈ 168GB (≈140GB raw weights)
70B model, int8: 70 × 1 × 1.2 ≈ 84GB
70B model, int4: 70 × 0.5 × 1.2 ≈ 42GB

The pattern is linear and easy to reason about: int8 halves the footprint versus fp16, and int4 halves it again. A 70B model carries roughly 140GB of raw fp16 weights; quantize to int4 and the whole thing fits in about 42GB once overhead is included. The VRAM & model-fit calculator runs this for any parameter count and tells you which GPUs in the dataset can hold the result.

How shrinking VRAM moves you down the GPU price ladder

The footprint number only matters because it changes which GPU you rent — and that is where the cost actually shifts. Walk the 70B example down the precision ladder:

fp16 (~140–168GB): exceeds a single 80GB card. You need two 80GB GPUs (or a larger multi-GPU instance), paying for two cards' worth of hourly cost — and eating the complexity and overhead of multi-GPU serving.
int8 (~84GB): still just over 80GB, typically still two cards or one very large card — savings are real on activation memory and bandwidth but the GPU count may not drop.
int4 (~42GB): fits comfortably on a single 80GB card, with room to spare for a healthy KV cache. One GPU instead of two roughly halves the fixed monthly cost.

Halving the fixed cost is enormous, because in the self-hosting model that fixed cost is the numerator of everything. Recall the structure: self-host monthly = $/hour × 730 × utilization, and effective $/1M = ($/hour ÷ 3600) ÷ (tok_s × util) × 1,000,000. Cutting the GPU count from two to one halves the $/hour term, which halves the monthly cost and halves the effective cost per token before any throughput effect is counted. Compare the candidate GPUs and their hourly rates in the GPU pricing dataset to see how large a tier-jump is in dollars.

The throughput effect: cheaper tokens, faster

Quantization does not only change which GPU you rent — it usually makes that GPU produce more tokens per second. LLM token generation is largely memory-bandwidth bound: for each token, the GPU streams the model's weights through its compute units, so the volume of weight bytes moved per token sets the pace. Halve the bytes per weight and you can move them roughly twice as fast, lifting tokens-per-second.

That matters because throughput is in the denominator of the cost-per-token formula:

Self-host $/1M = ($/hour ÷ 3600) ÷ (tokens_per_sec × utilization) × 1,000,000

$1.50/hour, 2,000 tok/s, 100% util → ≈ $0.21 / 1M
Same GPU, quantized to ~3,000 tok/s → ≈ $0.14 / 1M

So quantization can hit the cost from two directions at once: a smaller, cheaper GPU (lower numerator) and higher throughput on it (larger denominator). The throughput cost calculator lets you plug in the tokens-per-second you measure after quantizing and read off the real $/1M. The gains depend on hardware and kernel support — int4 throughput especially varies — so measure on your actual setup rather than assuming the textbook 2×.

When quantization flips the verdict

Put the two effects together and quantization can change a self-hosting decision outright. Recall that the break-even volume is self-host monthly ÷ blended API price × 1,000,000. Anything that lowers the self-hosting monthly cost lowers the break-even, meaning self-hosting starts winning at a smaller volume — possibly one you already exceed.

A concrete sketch: a 70B model in fp16 forces two GPUs at, say, $1.50/hour each, so $3.00/hour combined → $657/month even at just 30% utilization. Against a $3.00/1M blended API price, break-even sits around 219M tokens/month — a high bar. Quantize to int4, fit one GPU at $1.50/hour → $328.50/month, and break-even drops to roughly 109.5M tokens/month. A team running, say, 150M tokens a month was on the wrong side of break-even in fp16 and on the right side in int4. The verdict flipped without changing the model, the workload, or the API price — only the precision.

The one caveat: quality is not free. Lower precision can degrade output, and the risk grows as you go more aggressive — int8 is frequently near-lossless, int4 is more variable and task-dependent, and below int4 the loss becomes hard to ignore. This guide measures only dollars; it does not benchmark accuracy. A cheaper token that returns a worse answer is not a saving. Before you bank any of the cost reductions above, validate the quantized model's output on your task — and weigh any quality drop against the savings, not in isolation.

Putting it to work

The workflow for using quantization as a cost lever is short:

Compute the footprint at fp16, int8 and int4 with the bytes-per-parameter math, and check which GPUs each fits with the VRAM & model-fit calculator.
Price the GPU tier change — the savings is almost entirely about whether you drop a card or a tier; consult the GPU pricing dataset.
Measure throughput on the quantized model and read the real cost per token from the throughput cost calculator.
Recompute break-even with the new, lower monthly cost — and only then decide, after validating that quality holds for your task.

Used this way, quantization is the rare lever that improves nearly every term in the self-hosting cost equation at once. The discipline is to capture the savings in the math and the quality risk in your testing — never one without the other.

Frequently asked questions

How does quantization reduce GPU cost?

Quantization stores each model weight in fewer bits, which shrinks the VRAM the model needs. Less VRAM means the model fits on a smaller, cheaper GPU — or fits on one GPU where it previously needed two — and that directly lowers the fixed hourly cost you divide across your tokens. A 70B model needs ~140GB in fp16 but only ~42GB in int4 (with overhead), the difference between multiple high-end GPUs and a single mid-range one.

How much VRAM does a quantized model need?

Roughly parameters (in billions) × bytes per parameter × 1.2 for overhead. Bytes per parameter are 2 for fp16, 1 for int8, and 0.5 for int4. So a 70B model is about 140GB in fp16, 70GB in int8, and 42GB in int4 once the 1.2× headroom is included. The VRAM & model-fit calculator works this out for any model and GPU.

Does quantization make inference faster or slower?

It usually raises throughput. LLM generation is largely memory-bandwidth bound, so moving fewer bytes per weight per token lets the GPU produce more tokens per second — which lowers the cost per token. The gain depends on hardware and kernel support, and very aggressive quantization can occasionally bottleneck elsewhere, but the typical direction is faster and cheaper.

When does quantization flip the API-vs-self-hosting verdict?

When it drops a model onto a cheaper GPU tier or frees up a second GPU, the self-hosting fixed cost can fall by half or more. Combined with higher throughput, the effective $/1M can drop enough to push the break-even volume below your actual usage — turning a "stay on the API" verdict into a "self-hosting wins" one.

What is the catch with quantization?

Quality. Lower precision can degrade output, and the loss grows as you go more aggressive — int8 is often nearly lossless, int4 more variable, and below that increasingly risky. This guide treats quantization as an economic lever; always validate output quality on your own task before banking the savings, because a cheaper token that gives wrong answers is not a saving.

Sources & pricing references

Disclaimer. LLMTCO provides cost estimates and planning tools for informational purposes only. AI API and GPU prices change frequently; bundled defaults reflect publicly listed prices as of the verification date shown (Jun 25, 2026) and may be out of date — always confirm current pricing with the provider. These figures are estimates, not financial, tax, or procurement advice, and do not capture every real-world factor (latency, reliability, compliance, data privacy, engineering time).