Загрузка...

The Physics of LLM Inference at Scale | Suman Debnath (Anyscale) | OpenXdata 2026

From OpenXdata 2026 (Track 2): The Physics of LLM Inference at Scale by Suman Debnath.

🎤 SPEAKER
Suman Debnath, Technical Lead (ML), Anyscale

📝 ABSTRACT
While many developers can download a model from Hugging Face, few grasp why latency spikes the moment concurrent users hit an endpoint. This talk explores the "physics" of LLM inference, emphasizing that engineering a production-grade service requires understanding hardware constraints rather than just adding more GPUs. We will analyze the fundamental split between the prefill phase, which is compute-bound (GEMM) due to parallel prompt processing, and the decode phase, which is severely memory-bound (GEMV). Central to this is the KV Cache Crisis. To solve these bottlenecks, we will examine how continuous batching (iteration-level scheduling) and PagedAttention (block-based memory allocation) saturate GPUs. We bridge these single-GPU "physics" to distributed systems using Ray Serve. The session concludes with a live demo featuring a production-ready deployment of Ray Serve and vLLM.

📌 TOPICS
• LLM inference
• vLLM
• Ray Serve
• GPU optimization
• PagedAttention
• KV cache

⏱️ CHAPTERS
00:00 Intro: The Physics of LLM Inference
00:30 Life of a Token: How LLM Generation Works
03:00 Prefill vs Decode: Two Phases of Inference
06:30 Three Layers of Inference Engineering
08:30 KV Cache: Reducing Computation Costs
12:00 Speculative Decoding with Draft Models
18:00 Key Takeaways and Resources
18:30 Q&A and Closing

📺 FULL PLAYLIST
Watch all OpenXdata 2026 sessions: https://www.youtube.com/playlist?list=PLpRf3DoCIOgkrtZrihdnOaPjSqJz22Rqa

🌐 ABOUT OPENXDATA 2026
OpenXdata is a virtual conference exploring open data architectures, the lakehouse, and the infrastructure powering modern data + AI systems. Sessions span Apache Hudi, Iceberg, Spark, Polaris, agent infrastructure, and more. Learn more: https://www.openxdata.ai/
#LlmInference #Vllm #RayServe #GpuOptimization #OpenXdata2026

Видео The Physics of LLM Inference at Scale | Suman Debnath (Anyscale) | OpenXdata 2026 канала OnehouseHQ
Яндекс.Метрика
Все заметки Новая заметка Страницу в заметки
Страницу в закладки Мои закладки
На информационно-развлекательном портале SALDA.WS применяются cookie-файлы. Нажимая кнопку Принять, вы подтверждаете свое согласие на их использование.
О CookiesНапомнить позжеПринять