Guides
The reasoning behind the numbers — how inference cost works, and how to read the calculators.
How LLM inference cost is calculatedToken math, input vs output pricing, and the $/1M-tokens model that underpins every LLM cost estimate.API vs self-hosting: how break-even actually worksFixed vs marginal cost, the crossover volume, and why the break-even point is all about utilization.The hidden costs of self-hosting LLMsIdle GPU time, DevOps and engineering hours, reliability and redundancy — why the hourly rate is not the real cost.Quantization and cost: int8/int4 economicsHow quantization shrinks VRAM and shifts throughput — and when it changes the API-vs-self-hosting verdict.Batching, caching & throughput: cutting $/tokenHow batch processing, prompt caching and higher throughput lower the effective cost per token.When does self-hosting actually pay off?A decision framework that weighs cost break-even against latency, compliance, privacy and engineering time.GPU sizing for LLM inference: the VRAM mathParameters times bytes per parameter plus KV-cache headroom — how to size GPUs for any model.