Skipping 90% of KV dequant work speeds up LLM decode by 22%

(github.com)

1 points | by pidtom 6 hours ago ago

1 comments

$pidtom 6 hours ago

I’ve been working on KV cache compression and ran into a dequant bottleneck at long context.
Tried optimizing the kernel directly, tested ~14 approaches, none beat the baseline on Apple Silicon.
What ended up working was skipping value dequant for positions with negligible attention weight.
Flash attention computes weights before V accumulation, so you already know which positions won’t contribute.
At 32K context:
- ~90% of positions can be skipped
- +22.8% decode speedup (turbo3 KV)
- ~+5% even on q8_0 KV
- no PPL change
- NIAH improved (less quant noise in accumulation)
Also validated on M2 Pro, and currently being tested on CUDA.
Paper: https://github.com/TheTom/turboquant_plus/blob/main/docs/pap...