- Популярные видео
- Авто
- Видео-блоги
- ДТП, аварии
- Для маленьких
- Еда, напитки
- Животные
- Закон и право
- Знаменитости
- Игры
- Искусство
- Комедии
- Красота, мода
- Кулинария, рецепты
- Люди
- Мото
- Музыка
- Мультфильмы
- Наука, технологии
- Новости
- Образование
- Политика
- Праздники
- Приколы
- Природа
- Происшествия
- Путешествия
- Развлечения
- Ржач
- Семья
- Сериалы
- Спорт
- Стиль жизни
- ТВ передачи
- Танцы
- Технологии
- Товары
- Ужасы
- Фильмы
- Шоу-бизнес
- Юмор
The Physics of LLM Inference at Scale | Suman Debnath (Anyscale) | OpenXdata 2026
From OpenXdata 2026 (Track 2): The Physics of LLM Inference at Scale by Suman Debnath.
🎤 SPEAKER
Suman Debnath, Technical Lead (ML), Anyscale
📝 ABSTRACT
While many developers can download a model from Hugging Face, few grasp why latency spikes the moment concurrent users hit an endpoint. This talk explores the "physics" of LLM inference, emphasizing that engineering a production-grade service requires understanding hardware constraints rather than just adding more GPUs. We will analyze the fundamental split between the prefill phase, which is compute-bound (GEMM) due to parallel prompt processing, and the decode phase, which is severely memory-bound (GEMV). Central to this is the KV Cache Crisis. To solve these bottlenecks, we will examine how continuous batching (iteration-level scheduling) and PagedAttention (block-based memory allocation) saturate GPUs. We bridge these single-GPU "physics" to distributed systems using Ray Serve. The session concludes with a live demo featuring a production-ready deployment of Ray Serve and vLLM.
📌 TOPICS
• LLM inference
• vLLM
• Ray Serve
• GPU optimization
• PagedAttention
• KV cache
⏱️ CHAPTERS
00:00 Intro: The Physics of LLM Inference
00:30 Life of a Token: How LLM Generation Works
03:00 Prefill vs Decode: Two Phases of Inference
06:30 Three Layers of Inference Engineering
08:30 KV Cache: Reducing Computation Costs
12:00 Speculative Decoding with Draft Models
18:00 Key Takeaways and Resources
18:30 Q&A and Closing
📺 FULL PLAYLIST
Watch all OpenXdata 2026 sessions: https://www.youtube.com/playlist?list=PLpRf3DoCIOgkrtZrihdnOaPjSqJz22Rqa
🌐 ABOUT OPENXDATA 2026
OpenXdata is a virtual conference exploring open data architectures, the lakehouse, and the infrastructure powering modern data + AI systems. Sessions span Apache Hudi, Iceberg, Spark, Polaris, agent infrastructure, and more. Learn more: https://www.openxdata.ai/
#LlmInference #Vllm #RayServe #GpuOptimization #OpenXdata2026
Видео The Physics of LLM Inference at Scale | Suman Debnath (Anyscale) | OpenXdata 2026 канала OnehouseHQ
🎤 SPEAKER
Suman Debnath, Technical Lead (ML), Anyscale
📝 ABSTRACT
While many developers can download a model from Hugging Face, few grasp why latency spikes the moment concurrent users hit an endpoint. This talk explores the "physics" of LLM inference, emphasizing that engineering a production-grade service requires understanding hardware constraints rather than just adding more GPUs. We will analyze the fundamental split between the prefill phase, which is compute-bound (GEMM) due to parallel prompt processing, and the decode phase, which is severely memory-bound (GEMV). Central to this is the KV Cache Crisis. To solve these bottlenecks, we will examine how continuous batching (iteration-level scheduling) and PagedAttention (block-based memory allocation) saturate GPUs. We bridge these single-GPU "physics" to distributed systems using Ray Serve. The session concludes with a live demo featuring a production-ready deployment of Ray Serve and vLLM.
📌 TOPICS
• LLM inference
• vLLM
• Ray Serve
• GPU optimization
• PagedAttention
• KV cache
⏱️ CHAPTERS
00:00 Intro: The Physics of LLM Inference
00:30 Life of a Token: How LLM Generation Works
03:00 Prefill vs Decode: Two Phases of Inference
06:30 Three Layers of Inference Engineering
08:30 KV Cache: Reducing Computation Costs
12:00 Speculative Decoding with Draft Models
18:00 Key Takeaways and Resources
18:30 Q&A and Closing
📺 FULL PLAYLIST
Watch all OpenXdata 2026 sessions: https://www.youtube.com/playlist?list=PLpRf3DoCIOgkrtZrihdnOaPjSqJz22Rqa
🌐 ABOUT OPENXDATA 2026
OpenXdata is a virtual conference exploring open data architectures, the lakehouse, and the infrastructure powering modern data + AI systems. Sessions span Apache Hudi, Iceberg, Spark, Polaris, agent infrastructure, and more. Learn more: https://www.openxdata.ai/
#LlmInference #Vllm #RayServe #GpuOptimization #OpenXdata2026
Видео The Physics of LLM Inference at Scale | Suman Debnath (Anyscale) | OpenXdata 2026 канала OnehouseHQ
Комментарии отсутствуют
Информация о видео
12 мая 2026 г. 23:30:10
00:21:28
Другие видео канала




















