Загрузка...

PagedAttention: how vLLM packs 4x more chats on a GPU

Most LLM servers waste 60% of GPU memory. PagedAttention fixes it.

Why the waste? Every reply builds a KV cache — the model's memory of past tokens. Reply length is unknown: 20 tokens, or 2000. So the server reserves the worst case for every request. Fill 20 slots, leave 1980 empty. Stack a few users and the GPU is full of nothing.

PagedAttention borrows from your OS. Your operating system never gives a program one giant chunk of RAM — it hands out small fixed-size pages and tracks who owns which. PagedAttention does the same for the KV cache: each request grabs 16-token pages, only as it needs them. And when two chats share the same system prompt? They point at the same pages — identical bytes, stored once.

Fragmentation drops from 60% to under 4%. Throughput climbs 2-4x on the same hardware. vLLM, TGI, SGLang, TensorRT-LLM all ship PagedAttention by default. One paper rewrote how every LLM gets served.

The KV cache is virtual memory. Pages beat slabs.

Music: Markvard - Time [NCS Release] (NoCopyrightSounds)
https://ncs.io

#ai #vllm #pagedattention #llm #shorts

Видео PagedAttention: how vLLM packs 4x more chats on a GPU канала ProCode
Яндекс.Метрика
Все заметки Новая заметка Страницу в заметки
Страницу в закладки Мои закладки
На информационно-развлекательном портале SALDA.WS применяются cookie-файлы. Нажимая кнопку Принять, вы подтверждаете свое согласие на их использование.
О CookiesНапомнить позжеПринять