arXiv cs.LG
5/11/2026

LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction
Short summary
LKV replaces heuristic KV cache compression with learned, end-to-end optimization, achieving 15% cache retention with near-lossless performance on long-context benchmarks. The approach aligns compression directly with task objectives rather than static allocation rules. Product teams optimizing LLM inference efficiency should review this methodology.
- •Learned budgeting outperforms hand-crafted heuristics for KV cache allocation
- •Achieves 15% cache retention with minimal performance degradation on LongBench and RULER
- •Differentiable optimization framework enables task-aware compression strategies
Generated with AI, which can make mistakes.
Is this a good recommendation for you?
