Загрузка...

Prefill/Decode Disaggregation — AMD ATOM + ATOMesh (ROCm serving)

Prefill/decode disaggregation splits LLM inference into two phases — a compute-bound prefill and a memory-bound decode — and runs each on its own pool of GPUs.

Prefill reads your whole prompt in one parallel, compute-heavy pass; decode then emits one token at a time, bottlenecked by memory bandwidth. On a single worker they collide — a long prefill stalls the decode queue while memory-bound decodes leave the compute units idle. Disaggregation runs each phase on hardware tuned for its bottleneck, handing the KV cache across the interconnect between them. AMD's new ATOM + ATOMesh stack brings this same prefill/decode split, KV-aware scheduling, and OpenAI-compatible API to ROCm and Instinct GPUs.

Full explainer (interactive): https://learnaivisually.com/g/amd-atom-prefill-decode-disaggregation
Source: https://rocm.blogs.amd.com/software-tools-optimization/atomesh-inference/README.html

Learn AI & GPUs visually — free interactive courses at learnaivisually.com

#PrefillDecodeDisaggregation #LLM #AI #AMD

Видео Prefill/Decode Disaggregation — AMD ATOM + ATOMesh (ROCm serving) канала Learn AI Visually

AI LLM on-device

Комментарии отсутствуют

Информация о видео

Вчера, 20:54:15

00:01:01

Learn AI Visually

Правообладателям

Жалоба на материал Недопустимый материал Нарушение авторских прав

Комментарии

Поделиться

Другие видео канала

Claude Fable 5's safety-routing fallback, explained #Shorts

Why Self-Evolving AI Agents Collapse — the 3-Knob Fix #Shorts

FastContext: a read-only explorer subagent cuts coding-agent tokens 60% #Shorts

Lookahead Sparse Attention — KV cache → 13.5% #Shorts

Gemma 4 QAT — fit a real LLM in ~1 GB #Shorts

INT8 finally beats FP8 on consumer GPUs — Fused INT8 GEMM kernel #Shorts

Manifold Power Iteration — MoE router fix #Shorts

Why LLM text embeddings are blurry — EmbedFilter #Shorts

Diversity-driven RL: how a 3B model reasons like a giant #Shorts

Predictive Validity — why agent leaderboards mislead #Shorts

Encoder-Free Multimodal — Gemma 4 12B #Shorts

EvoMem: patch-based agent memory — store changes as a changelog #Shorts

Monte Carlo Graph Search (MLEvolve) — how self-evolving agents beat AlphaEvolve

AdaSR — Streaming Reasoning explained #Shorts

What is MiniMax Sparse Attention (MSA)? #Shorts

Why outcome-only grading overstates AI agents #Shorts

Subquadratic Sparse Attention, explained #Shorts

Why computer-use agents clear only ~30% (Workflow-GYM) #Shorts

How LLMs compress a long prompt 16x · Latent Context LMs #Shorts

DRPO — a smooth trust-region penalty for LLM RL, explained #Shorts

FastContext: a read-only explorer subagent cuts coding-agent tokens 60%

Все заметки Новая заметка Страницу в заметки

Страницу в закладки Мои закладки

На информационно-развлекательном портале SALDA.WS применяются cookie-файлы. Нажимая кнопку Принять, вы подтверждаете свое согласие на их использование.

О Cookies Напомнить позже Принять