tierKV: A Distributed KV Cache That Makes Evicted Blocks Faster to Restore Than GPU Cache Hits

Short summary

tierKV intercepts evicted KV blocks from GPU cache, quantizes them to INT8, stores on LAN-adjacent machines, and injects them directly into inference buffers—bypassing attention recomputation and delivering 20× faster restoration than cold prefill (0.52s vs 10.75s on 30k tokens). Integrates with vLLM and EXO via plugin APIs with zero source changes. Projected speedup reaches 35× on 128k contexts since restoration is O(n) while prefill scales as O(n²).

•Quantized KV blocks stored on LAN RAM restore 20× faster than cold prefill and even outpace GPU cache hits by bypassing attention recomputation
•Zero-modification integration with vLLM and EXO via standard plugin APIs; 5-step setup with tierkv.toml configuration
•Speedup gap widens at longer contexts (35× projected at 128k tokens) since restoration is O(n) + network while prefill is O(n²)

Generated with AI, which can make mistakes.

#ai-tools #open-source #research-breakthrough

Read full article at Dev.to

Is this a good recommendation for you?

tierKV: A Distributed KV Cache That Makes Evicted Blocks Faster to Restore Than GPU Cache Hits

Short summary

Comments

Explore more