The Physics of LLM Inference at Scale | Suman Debnath (Anyscale) | OpenXdata 2026

From OpenXdata 2026 (Track 2): The Physics of LLM Inference at Scale by Suman Debnath.

🎤 SPEAKER
Suman Debnath, Technical Lead (ML), Anyscale

📝 ABSTRACT
While many developers can download a model from Hugging Face, few grasp why latency spikes the moment concurrent users hit an endpoint. This talk explores the "physics" of LLM inference, emphasizing that engineering a production-grade service requires understanding hardware constraints rather than just adding more GPUs. We will analyze the fundamental split between the prefill phase, which is compute-bound (GEMM) due to parallel prompt processing, and the decode phase, which is severely memory-bound (GEMV). Central to this is the KV Cache Crisis. To solve these bottlenecks, we will examine how continuous batching (iteration-level scheduling) and PagedAttention (block-based memory allocation) saturate GPUs. We bridge these single-GPU "physics" to distributed systems using Ray Serve. The session concludes with a live demo featuring a production-ready deployment of Ray Serve and vLLM.

📌 TOPICS
• LLM inference
• vLLM
• Ray Serve
• GPU optimization
• PagedAttention
• KV cache

⏱️ CHAPTERS
00:00 Intro: The Physics of LLM Inference
00:30 Life of a Token: How LLM Generation Works
03:00 Prefill vs Decode: Two Phases of Inference
06:30 Three Layers of Inference Engineering
08:30 KV Cache: Reducing Computation Costs
12:00 Speculative Decoding with Draft Models
18:00 Key Takeaways and Resources
18:30 Q&A and Closing

📺 FULL PLAYLIST
Watch all OpenXdata 2026 sessions: https://www.youtube.com/playlist?list=PLpRf3DoCIOgkrtZrihdnOaPjSqJz22Rqa

🌐 ABOUT OPENXDATA 2026
OpenXdata is a virtual conference exploring open data architectures, the lakehouse, and the infrastructure powering modern data + AI systems. Sessions span Apache Hudi, Iceberg, Spark, Polaris, agent infrastructure, and more. Learn more: https://www.openxdata.ai/
#LlmInference #Vllm #RayServe #GpuOptimization #OpenXdata2026

Видео The Physics of LLM Inference at Scale | Suman Debnath (Anyscale) | OpenXdata 2026 канала OnehouseHQ

LLM inference vLLM Ray Serve GPU optimization PagedAttention KV cache Anyscale ML serving AI infrastructure Suman Debnath OpenXdata 2026 OpenXdata data engineering open data lakehouse data infrastructure

Комментарии отсутствуют

Информация о видео

12 мая 2026 г. 23:30:10

00:21:28

OnehouseHQ

Теги

Правообладателям

Жалоба на материал Недопустимый материал Нарушение авторских прав

Комментарии

Другие видео канала

The Physics of LLM Inference at Scale | Suman Debnath (Anyscale) | OpenXdata 2026

Panel — Open by Design: Community as the Foundation of Open Source Data | Panel | OpenXdata 2026

Vortex: Building GPU-Native Columnar Storage | Will Manning (Spiral (SpiralDB)) | OpenXdata 2026

Spark Analyzer Demo: Measure and evaluate your Apache Spark™ Applications Performance

Onehouse LakeView Deployment Demo - Pull Model

Data Lakehouse Deep Dive: Hudi, Iceberg, and Delta Lake

OpenXData - Not Just Lettuce: How Apache Iceberg™ and dbt Are Reshaping the Data Aisle

The New Normal: Unbundling Your Data Platform With an Open Data Lakehouse - Highlights Part 1

The Spark You Knew Is Dead: Inside the Quiet Lakehouse Re… | Kyle Weller (Onehouse) | OpenXdata 2026

How does Merge on Read work in Apache Hudi?

Step by Step Guide for Change Data Capture from PostgreSQL to the Onehouse Universal Data Lakehouse

OpenXData - Open Source Query Performance: Inside the next-gen Presto C++ engine

OpenXData - Bring the Power of Google Infrastructure to your Apache Iceberg™ Lakehouse with BigQuery

NOW Insurance Brings ML/AI to Life with Onehouse

Workshop: Apache Spark on Kubernetes with Quanton | Sagar Lakshmipathy (Onehouse) | OpenXdata 2026

Apache Hudi for the Next Generation of AI: Unstru… | Rahil Chertara & Timothy Brown | OpenXdata 2026

Implementing the fastest, most open data lakehouse for Snowflake ETL/ELT

Apache XTable Brings Interoperability for Hudi, Iceberg, and Delta Lake

OpenXData - Apache Gluten: Revolutionizing Big Data Processing Efficiency

OpenXdata Virtual Conference

The End of Lambda Architecture: Apache Beam's Unified Data Processing Revolution