Back to feed
arXiv cs.LG
arXiv cs.LG
5/11/2026
RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory

RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory

Short summary

RateQuant solves a critical LLM serving bottleneck by optimizing KV cache quantization—allocating different bit-widths to attention heads using rate-distortion theory to match per-quantizer distortion models. On Qwen3-8B at 2.5 average bits, it achieves a 70% perplexity reduction compared to prior methods like KIVI and QuaRot. Full calibration takes just 1.6 seconds on a single GPU with zero additional inference overhead.

  • Solves distortion model mismatch problem in mixed-precision KV cache quantization via rate-distortion theory
  • 70% perplexity reduction on Qwen3-8B (49.3 → 14.9 PPL) at 2.5 average bits; 6.6 PPL improvement over QuaRot
  • Single-GPU calibration in 1.6 seconds with zero inference-time overhead

Generated with AI, which can make mistakes.

Is this a good recommendation for you?

Explore more