Загрузка...

Inside MAX Serve: From Prompt to Response

MAX serve is Modular's open source inference server. In this interview, AI Performance Engineer Kyle Caverly walks through what happens from the moment a request arrives to the moment text streams back to the client.

You can explore Kyle's diagram here:
https://drive.google.com/file/d/1Zigsjtq37lUqp9_YmUU_T53CIgB-Duz_/view?usp=drive_link

All of the code discussed is open source. Start with the MAX serve repository: https://github.com/modular/modular/tree/main/max/python/max/serve

0:00 Intro
0:35 MAX serve architecture
3:27 API server receives request
5:48 Server creates TextContext object
9:23 Request reaches the model worker
12:25 Construct the batch via TextBatchConstructor
15:48 Prefix caching and chunked prefill
21:23 Pipeline execution
25:24 Consuming completed tokens
27:43 Post-process output and prepare response
29:39 Client receives response
30:51 Multimodality
33:28 Open source code

Видео Inside MAX Serve: From Prompt to Response канала Modular

Комментарии отсутствуют

Информация о видео

14 апреля 2026 г. 21:06:10

00:34:38

Правообладателям

Жалоба на материал Недопустимый материал Нарушение авторских прав

Комментарии

Поделиться

Другие видео канала

Modular Tech Talk: Mojo GPU Compilation 🔥

Open GenAI on AMD: Ramine Roane at the Modular GPU Kernel Hackathon

ModCon 2023: Building llama.🔥 by Aydyn Tairov

ModCon 2023 Breakout Session: MAX Engine Extensibility

Chuan Li at Modular Hack Weekend: Solving 2D Puzzles with Small LLMs

Modular x Inworld x Oracle

Modular Meetup: Inside the MAX Framework

MAX’s Graph Compiler Internals with Feras Boulala

Modular Hack Weekend Highlights

Democratizing AI compute together: Chris Lattner at the Modular GPU Kernel Hackathon

Claude On Three Accelerators: Simon Boehm and Sasha Krassovsky at the Modular GPU Kernel Hackathon

June 2025 Community Meeting: Mojo in bioinformatics and accelerating particle physics

ModCon 2023 Panel Session: AI in Production

Modular Tech Talk: Mammoth Serving

Introduction to Programming GPUs using Custom Operations with MAX - Part 1: Simple Custom Op

August 2025 Community Meeting: mojo-regex optimizations and Apple GPU support

ModCon 2023 Panel Session: Future of AI Software

Flux.2 Image Generation in Under 1 Second by Modular

Modular Community Livestream - New in MAX 24.2

Mojo 🔥 GPU Puzzles Tutorial - Puzzle 02: Zip

Все заметки Новая заметка Страницу в заметки

Страницу в закладки Мои закладки

На информационно-развлекательном портале SALDA.WS применяются cookie-файлы. Нажимая кнопку Принять, вы подтверждаете свое согласие на их использование.

О Cookies Напомнить позже Принять