Загрузка...

Inside MAX Serve: From Prompt to Response

MAX serve is Modular's open source inference server. In this interview, AI Performance Engineer Kyle Caverly walks through what happens from the moment a request arrives to the moment text streams back to the client.

You can explore Kyle's diagram here:
https://drive.google.com/file/d/1Zigsjtq37lUqp9_YmUU_T53CIgB-Duz_/view?usp=drive_link

All of the code discussed is open source. Start with the MAX serve repository: https://github.com/modular/modular/tree/main/max/python/max/serve

0:00 Intro
0:35 MAX serve architecture
3:27 API server receives request
5:48 Server creates TextContext object
9:23 Request reaches the model worker
12:25 Construct the batch via TextBatchConstructor
15:48 Prefix caching and chunked prefill
21:23 Pipeline execution
25:24 Consuming completed tokens
27:43 Post-process output and prepare response
29:39 Client receives response
30:51 Multimodality
33:28 Open source code

Видео Inside MAX Serve: From Prompt to Response канала Modular
Яндекс.Метрика
Все заметки Новая заметка Страницу в заметки
Страницу в закладки Мои закладки
На информационно-развлекательном портале SALDA.WS применяются cookie-файлы. Нажимая кнопку Принять, вы подтверждаете свое согласие на их использование.
О CookiesНапомнить позжеПринять