Lightning Talk: Not All Tokens Are Equal: Semantic KV-Cache for Agen... Maroon Ayoub & Hyunkyun Moon

Lightning Talk: Not All Tokens Are Equal: Semantic KV-Cache for Agentic LLM Serving - Maroon Ayoub, IBM Research & Hyunkyun Moon, moreh

Agentic AI workloads - tree-of-thought exploration, ReAct loops, hierarchical swarms - expose a fundamental mismatch in how we serve PyTorch models. Today's inference stacks treat the KV-cache as a flat, anonymous tensor buffer with blind LRU eviction. This ignores the structural reality of agents: system prompts are durable, tool definitions are shared, and reasoning scratchpads are ephemeral. We are currently evicting high-value state to preserve throwaway tokens.

In this talk, we present Semantic KV-Cache, an architectural evolution for llm-d and vLLM that replaces anonymous blocks with Typed State.

We demonstrate a runtime that tags blocks as SystemPrompt, ToolDefinition, or ReasoningBranch, applying differentiated policies to each: pinning foundational context, replicating shared tools, and eagerly evicting completed thoughts. We show how this "lifecycle-aware" caching reduces recomputation and minimizes the "Agentic Tax" - evolving the PyTorch serving stack from request-centric to workload-aware.

Видео Lightning Talk: Not All Tokens Are Equal: Semantic KV-Cache for Agen... Maroon Ayoub & Hyunkyun Moon канала PyTorch

Комментарии отсутствуют

Информация о видео

21 апреля 2026 г. 1:21:43

00:10:27

PyTorch

Правообладателям

Жалоба на материал Недопустимый материал Нарушение авторских прав

Комментарии

Другие видео канала

Lightning Talk: Not All Tokens Are Equal: Semantic KV-Cache for Agen... Maroon Ayoub & Hyunkyun Moon

Provenance Tracking for Inductor - PyTorch Compiler Series Episode 2

Panel Discussion - T. Dettmers, H. Schoelkopf, A. Chowdhery, A. Conneau, Moderated by K. Khandelwal

Keynote: vLLM & Ray Updates - Tyler Michael Smith & Artur Niederfahrenhorst

Weaver: Shrinking the Generation-Verification Gap with Weak Verifiers - Jon Saad-Falcon

Lightning Talk: Slash LLM Cold-Start Times by Pre-distributing GPU... Billy McFall & Maryam Tahhan

PyTorch Day India 2026 Keynote: Full Stack AI Innovation: PyTorch + NVIDIA From Edge to Data Center

PyTorch Expert Exchange: Adapting open source models with Open-Instruct and Tulu

Accelerating Complex-Valued Tensors With Torch.compile - Hameer Abbasi, OpenTeams Inc.

Keynote: The Unbearable Lightness of (Agentic) Evaluations - Besmira Nushi

Keynote: Navigating the Architectural Timeline of LLMs - Sebastian Raschka, Lightning AI

Sponsored Session: Everything Everywhere all at Once: vLLM...- Brittany Rockwell & Shireen Kheradpey

The Building Blocks of Agentic Al - Joe Spisak, Product Director, Meta Superintelligence Labs

Teaching PyTorch To Read Your Worst PDFs With Docling - Mingxuan Zhao, Peter Staar & Carol Chen

Tour De Force: LLM Inference Optimization From Simple To Sophisticated - Christin Pohl, Microsoft

Parameterized CUDA Graph Launch in PyTorch: CUDA Graphs Without the Pain - Daniel Galvez, NVIDIA

Keynote: Welcome Back - Matt White, with Special Guest Joe Spisak, Meta

PyTorch Foundation with Ibrahim Haddad

Can Reinforcement Learning Lead to AGI? - Daniel Han, Unsloth

Keynote: Noam Brown, Research Scientist at OpenAI in Conversation with Joe Spisak, Meta

PyTorch Day India 2026 Enabling a New Device for Pytorch through OpenReg Antoni Viros i Martin, IBM

Lightning Talk: Inside VLLM's KV Offloading Connector: Async Memory Transfers for... Nicolò Lucchesi