Загрузка...

Prefill/Decode Disaggregation — AMD ATOM + ATOMesh (ROCm serving)

Prefill/decode disaggregation splits LLM inference into two phases — a compute-bound prefill and a memory-bound decode — and runs each on its own pool of GPUs.

Prefill reads your whole prompt in one parallel, compute-heavy pass; decode then emits one token at a time, bottlenecked by memory bandwidth. On a single worker they collide — a long prefill stalls the decode queue while memory-bound decodes leave the compute units idle. Disaggregation runs each phase on hardware tuned for its bottleneck, handing the KV cache across the interconnect between them. AMD's new ATOM + ATOMesh stack brings this same prefill/decode split, KV-aware scheduling, and OpenAI-compatible API to ROCm and Instinct GPUs.

Full explainer (interactive): https://learnaivisually.com/g/amd-atom-prefill-decode-disaggregation
Source: https://rocm.blogs.amd.com/software-tools-optimization/atomesh-inference/README.html

Learn AI & GPUs visually — free interactive courses at learnaivisually.com

#PrefillDecodeDisaggregation #LLM #AI #AMD

Видео Prefill/Decode Disaggregation — AMD ATOM + ATOMesh (ROCm serving) канала Learn AI Visually
Яндекс.Метрика
Все заметки Новая заметка Страницу в заметки
Страницу в закладки Мои закладки
На информационно-развлекательном портале SALDA.WS применяются cookie-файлы. Нажимая кнопку Принять, вы подтверждаете свое согласие на их использование.
О CookiesНапомнить позжеПринять