Загрузка...

TurboQuant Explained: The Paper That Shrunk AI Memory 6x

Google just compressed the KV cache by 6x with ZERO accuracy loss and made attention 8x faster on H100 GPUs. No retraining. No calibration. Just a 2-page math proof and a 1-bit residual trick that's been hiding since 1963.

I'm breaking down TurboQuant (ICLR 2026) from Google Research, end to end: what the KV cache actually is, why it's the real bottleneck in modern LLM inference, and exactly how PolarQuant + QJL combine to get within 2.7x of Shannon's information-theoretic lower bound.

Paper: https://arxiv.org/abs/2504.19874
Google research blog: https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/

Timestamps:
00:00 The 6x / 0% headline
00:38 What is the KV cache
01:30 Parameters plateaued, but the cache didn't
02:17 Why every previous method (SnapKV, KIVI, PQ) failed
03:16 The big idea: random rotation
04:03 Stage 1: PolarQuant + Lloyd-Max quantizers
05:07 Stage 2: The 1-bit QJL residual trick
06:13 Within 2.7x of Shannon's limit
07:16 The benchmarks: LongBench parity + Needle-in-Haystack
08:22 Production impact: vLLM, llama.cpp, and the chip stocks
09:10 My take + what's next
10:07 Outro

Видео TurboQuant Explained: The Paper That Shrunk AI Memory 6x канала Sebastian Buzdugan
Яндекс.Метрика
Все заметки Новая заметка Страницу в заметки
Страницу в закладки Мои закладки
На информационно-развлекательном портале SALDA.WS применяются cookie-файлы. Нажимая кнопку Принять, вы подтверждаете свое согласие на их использование.
О CookiesНапомнить позжеПринять