Загрузка...

Day-1 TurboQuant in llama.cpp: 6X Smaller KV Cache After Reading the Actual Paper

I extended the first CUDA implementation of TurboQuant in llama.cpp from 2-bit to 3-bit quantization, added V cache compression and flash attention support achieving 4.57x KV cache compression.

Wrote the first paper documenting it, and published the repo.

And now here is a video documenting what I did.

This video breaks down what TurboQuant is, why it matters, and how I implemented it — explained for builders, not just researchers.

What's covered:
- Why large language models run out of memory (the KV cache problem)
- How quantization compresses the model's "short-term memory"
- The 3 phases: K cache → V cache → flash attention integration
- Real benchmarks: 72K context on dual RTX 3090s (normally impossible)
- The one constant that broke everything (and how I found it)

Paper: https://oliverchurch.com/turboquant-for-ggml-achieving-4.57x-kv-cache-compression-in-llama.cpp.html

Repo: https://github.com/animehacker/llama-turboquant

Original TurboQuant paper: https://arxiv.org/abs/2504.19874

Видео Day-1 TurboQuant in llama.cpp: 6X Smaller KV Cache After Reading the Actual Paper канала Oliver Church - AI News & Insights
Яндекс.Метрика
Все заметки Новая заметка Страницу в заметки
Страницу в закладки Мои закладки
На информационно-развлекательном портале SALDA.WS применяются cookie-файлы. Нажимая кнопку Принять, вы подтверждаете свое согласие на их использование.
О CookiesНапомнить позжеПринять