Apply video compression on KV cache to 10,000x less error at Q4 quant

(github.com)

32 points | by polymorph1sm 13 hours ago ago

4 comments

$tveita 2 hours ago

"video compression" by analogy only, what this claims to actually do is delta encode the values in each token from the previous token.
Interesting idea, but the results seem almost suspicious? even accounting for the extra bits used to store the 16-bit start value for each block - ~5% for k=64
The code does funky things, like the encoder updates the reference value for each encoded token, using the non-quantized value! [1] But the decoder just ignored all that. [2] how can this work?
[1] https://github.com/cenconq25/delta-compress-llm/commit/f185f...
[2] https://github.com/cenconq25/delta-compress-llm/commit/f185f...
$Reubend 5 hours ago

This is really cool research, but I'm wondering how much it slows down inference. The readme says that it's "...distinguished by zero overhead (no learned components, no entropy coding)" but does that really mean that this is a "free win"?
$mungoman2 5 hours ago

This is cool. It makes storage of the KV cache much smaller, making it possible to keep more of it in fast memory.
Bandwidth-wise it is worse (more bytes accessed) to generate and do random recall on than the vanilla approach, and significantly worse than a quantized approach. That’s because the reference needs to be accessed.
I guess implied is that since the KV cache is smaller, the probability is higher that the parts it that are needed are in fast memory, and that bandwidth requirements of slow links is reduced, and performance goes up.
Would be interesting with a discussion about benefits/drawbacks of the approach. Ideally backed by data.
$mike_hearn 3 hours ago

Nice, although perhaps slightly academic given that good KV cache compression algorithms already exist. Probably the frontier labs were using them for a long time already. Nice to have it in llama.cpp though.
I'm curious who "we" refers to. I can't see any authorship information or a paper and this is the user's only repository. Maybe it doesn't need one. Also interesting that it was developed and tested on AMD hardware.
The main utility of this beyond just saving money for model servers would be deliberately prefilling very long contexts and then saving them to fast flash so you can then later quickly load and query them. I think only Anthropic's API would give enough control to do this today, maybe Google's, OpenAI's makes caching fully implicit. Like one or two prompts per codebase or something like that, so you can then query the entire codebase in parallel with questions without needing grepping or RAG. Modern serving pipelines all use disaggregated prefill as far as I know so there are inter-machine transfers anyway, and it directly saves on GPU cost.