- Популярные видео
- Авто
- Видео-блоги
- ДТП, аварии
- Для маленьких
- Еда, напитки
- Животные
- Закон и право
- Знаменитости
- Игры
- Искусство
- Комедии
- Красота, мода
- Кулинария, рецепты
- Люди
- Мото
- Музыка
- Мультфильмы
- Наука, технологии
- Новости
- Образование
- Политика
- Праздники
- Приколы
- Природа
- Происшествия
- Путешествия
- Развлечения
- Ржач
- Семья
- Сериалы
- Спорт
- Стиль жизни
- ТВ передачи
- Танцы
- Технологии
- Товары
- Ужасы
- Фильмы
- Шоу-бизнес
- Юмор
PagedAttention: how vLLM packs 4x more chats on a GPU
Most LLM servers waste 60% of GPU memory. PagedAttention fixes it.
Why the waste? Every reply builds a KV cache — the model's memory of past tokens. Reply length is unknown: 20 tokens, or 2000. So the server reserves the worst case for every request. Fill 20 slots, leave 1980 empty. Stack a few users and the GPU is full of nothing.
PagedAttention borrows from your OS. Your operating system never gives a program one giant chunk of RAM — it hands out small fixed-size pages and tracks who owns which. PagedAttention does the same for the KV cache: each request grabs 16-token pages, only as it needs them. And when two chats share the same system prompt? They point at the same pages — identical bytes, stored once.
Fragmentation drops from 60% to under 4%. Throughput climbs 2-4x on the same hardware. vLLM, TGI, SGLang, TensorRT-LLM all ship PagedAttention by default. One paper rewrote how every LLM gets served.
The KV cache is virtual memory. Pages beat slabs.
Music: Markvard - Time [NCS Release] (NoCopyrightSounds)
https://ncs.io
#ai #vllm #pagedattention #llm #shorts
Видео PagedAttention: how vLLM packs 4x more chats on a GPU канала ProCode
Why the waste? Every reply builds a KV cache — the model's memory of past tokens. Reply length is unknown: 20 tokens, or 2000. So the server reserves the worst case for every request. Fill 20 slots, leave 1980 empty. Stack a few users and the GPU is full of nothing.
PagedAttention borrows from your OS. Your operating system never gives a program one giant chunk of RAM — it hands out small fixed-size pages and tracks who owns which. PagedAttention does the same for the KV cache: each request grabs 16-token pages, only as it needs them. And when two chats share the same system prompt? They point at the same pages — identical bytes, stored once.
Fragmentation drops from 60% to under 4%. Throughput climbs 2-4x on the same hardware. vLLM, TGI, SGLang, TensorRT-LLM all ship PagedAttention by default. One paper rewrote how every LLM gets served.
The KV cache is virtual memory. Pages beat slabs.
Music: Markvard - Time [NCS Release] (NoCopyrightSounds)
https://ncs.io
#ai #vllm #pagedattention #llm #shorts
Видео PagedAttention: how vLLM packs 4x more chats on a GPU канала ProCode
Комментарии отсутствуют
Информация о видео
30 мая 2026 г. 17:26:29
00:01:16
Другие видео канала













![Why does [] == false return true in JavaScript?](https://i.ytimg.com/vi/ZjhuTLV4ixc/default.jpg)







