Quantization trades numerical precision for efficiency. Weights stored at lower bit depths consume less memory and move through caches faster, often reducing inference latency. If a model has \(N\) billions of parameters at \(b_o\) bits each, its approximate memory requirement is:
Quantizing to \(b_q\) bits changes the memory to \(M_q = N × b_q / 8\). The savings \(S\) and percent reduction follow:
Latency improvements scale roughly with the bit-width ratio, so the quantized latency \(L_q\) becomes \(L_q = L_o × b_q / b_o\). When inference cost is dominated by compute, the cost per thousand tokens is
where \(K\) is the hourly hardware cost. These equations simplify reality—calibration data, dequantization, and layer-specific behavior also matter—but they provide a quick north star for sizing hardware or budgeting inference workloads.
Metric | Original | Quantized |
---|---|---|
Memory (GB) | 14.00 | 7.00 |
Latency (ms/token) | 30.0 | 15.0 |
Cost per 1k tokens ($) | 0.021 | 0.010 |
Dive deeper by pairing this calculator with the Cloud GPU Rental Cost Calculator, the Model Ensemble Inference Cost Calculator, and the Batch Inference Throughput & Latency Calculator to evaluate hardware provisioning, ensemble strategies, and workload batching in tandem with quantization.