- Популярные видео
- Авто
- Видео-блоги
- ДТП, аварии
- Для маленьких
- Еда, напитки
- Животные
- Закон и право
- Знаменитости
- Игры
- Искусство
- Комедии
- Красота, мода
- Кулинария, рецепты
- Люди
- Мото
- Музыка
- Мультфильмы
- Наука, технологии
- Новости
- Образование
- Политика
- Праздники
- Приколы
- Природа
- Происшествия
- Путешествия
- Развлечения
- Ржач
- Семья
- Сериалы
- Спорт
- Стиль жизни
- ТВ передачи
- Танцы
- Технологии
- Товары
- Ужасы
- Фильмы
- Шоу-бизнес
- Юмор
Day-1 TurboQuant in llama.cpp: 6X Smaller KV Cache After Reading the Actual Paper
I extended the first CUDA implementation of TurboQuant in llama.cpp from 2-bit to 3-bit quantization, added V cache compression and flash attention support achieving 4.57x KV cache compression.
Wrote the first paper documenting it, and published the repo.
And now here is a video documenting what I did.
This video breaks down what TurboQuant is, why it matters, and how I implemented it — explained for builders, not just researchers.
What's covered:
- Why large language models run out of memory (the KV cache problem)
- How quantization compresses the model's "short-term memory"
- The 3 phases: K cache → V cache → flash attention integration
- Real benchmarks: 72K context on dual RTX 3090s (normally impossible)
- The one constant that broke everything (and how I found it)
Paper: https://oliverchurch.com/turboquant-for-ggml-achieving-4.57x-kv-cache-compression-in-llama.cpp.html
Repo: https://github.com/animehacker/llama-turboquant
Original TurboQuant paper: https://arxiv.org/abs/2504.19874
Видео Day-1 TurboQuant in llama.cpp: 6X Smaller KV Cache After Reading the Actual Paper канала Oliver Church - AI News & Insights
Wrote the first paper documenting it, and published the repo.
And now here is a video documenting what I did.
This video breaks down what TurboQuant is, why it matters, and how I implemented it — explained for builders, not just researchers.
What's covered:
- Why large language models run out of memory (the KV cache problem)
- How quantization compresses the model's "short-term memory"
- The 3 phases: K cache → V cache → flash attention integration
- Real benchmarks: 72K context on dual RTX 3090s (normally impossible)
- The one constant that broke everything (and how I found it)
Paper: https://oliverchurch.com/turboquant-for-ggml-achieving-4.57x-kv-cache-compression-in-llama.cpp.html
Repo: https://github.com/animehacker/llama-turboquant
Original TurboQuant paper: https://arxiv.org/abs/2504.19874
Видео Day-1 TurboQuant in llama.cpp: 6X Smaller KV Cache After Reading the Actual Paper канала Oliver Church - AI News & Insights
Комментарии отсутствуют
Информация о видео
1 апреля 2026 г. 22:04:57
00:12:26
Другие видео канала




















