Quantization and Performance Trade-offs in LLM Serving
How model quantization (INT8, INT4, and similar) trades accuracy for latency, throughput, and memory savings, and how to evaluate the trade-off.
Quantization reduces the numerical precision used to store and compute a model’s weights (and sometimes activations) — from the original 16-bit (or 32-bit) floating point down to 8-bit, 4-bit, or even lower-bit representations — trading some accuracy for substantial gains in memory footprint, throughput, and latency.
Why quantization helps performance, mechanically
Smaller weight representations mean less data to move from memory to compute units per operation — directly addressing the memory-bandwidth bottleneck covered in this site’s GPU utilization article, since LLM decode is typically memory-bandwidth-bound. A model quantized to 4-bit weights moves roughly a quarter the data per decode step compared to 16-bit weights, which can translate to meaningfully higher achievable throughput on the same hardware, and lets larger models fit on smaller/cheaper GPUs that couldn’t hold the full-precision version in memory at all.
The accuracy cost, and why it’s not always proportional to bit-width
Lower bit-width quantization generally degrades model output quality, but the relationship isn’t simply linear — well-implemented quantization techniques (GPTQ, AWQ, and similar) using calibration data and per-channel/per-group scaling can achieve surprisingly small quality degradation even at aggressive bit-widths (4-bit and sometimes lower), while naive uniform quantization at the same bit-width can degrade quality noticeably more. The specific quantization method, not just the bit-width number, matters significantly for the actual accuracy-performance trade-off achieved.
Where quantization tends to hurt accuracy most
Quality degradation from quantization is often unevenly distributed across tasks — some studies and practical experience suggest tasks requiring precise numerical reasoning or very long-context coherence tend to be more sensitive to aggressive quantization than general conversational or summarization tasks. Evaluate quantized models against your specific downstream task’s quality bar, not just generic benchmark scores, since generic benchmarks may not reflect your particular sensitivity profile.
Weight-only vs weight-and-activation quantization
Some approaches quantize only the stored weights (computing in higher precision during the actual forward pass, dequantizing on the fly) while others quantize activations too for further throughput gains — weight-only quantization is generally safer for accuracy (most production LLM quantization in practice is weight-only or close to it) while full activation quantization pushes further on performance at typically greater accuracy risk.
Quantization-aware serving support
Not every inference server and hardware combination supports every quantization format efficiently — confirm your serving stack (vLLM, TGI, or others) has good kernel support for the specific quantization method you’re considering before committing, since a quantization format with poor kernel support can end up slower than full precision despite the smaller memory footprint, due to dequantization overhead during compute.
A practical evaluation approach
- Establish a quality baseline on full precision against your actual task/eval set, not just a generic benchmark.
- Quantize using a well-supported, calibration-based method (not naive uniform quantization) appropriate to your serving stack.
- Re-run the same quality evaluation against the quantized model and compare directly.
- Measure actual throughput/latency gains on your real serving hardware — theoretical bit-width reduction doesn’t always translate 1:1 to measured performance gains, depending on kernel support and whether you were actually memory-bandwidth-bound to begin with.
Takeaway: quantization is one of the highest-leverage cost/performance levers in LLM serving, but the accuracy cost depends heavily on the specific method and your specific task — evaluate empirically against your own use case rather than assuming a given bit-width’s commonly cited “minimal accuracy loss” claim transfers directly to your situation.
Comments are powered by Giscus (GitHub Discussions). Enable them by
configuring GISCUS in src/consts.ts — see
giscus.app.